§ ¶Troubles with _mm_loadl_epi64()
Alright, who was the dork who designed this SSE2 compiler intrinsic:
__m128i _mm_loadl_epi64(__m128i const *p);
What does this intrinsic do? It loads a 64-bit integer value from memory and stores it into the low 64 bits of an XMM register, zeroing the upper 64 bits. It's the compiler intrinsic version of the MOVQ instruction. MOVQ is fairly important for image processing routines in SSE2 for a couple of reasons: it's very convenient to process 64 bits of data, since eight 8-bit samples can be loaded and expanded as 16-bit words in a 128-bit register, and 128-bit memory accesses can't be misaligned like 64-bit memory accesses can.
Anyway, I ran into this while porting my old AP-922 based IDCT routine to intrinsics in order to recompute the constants according to a tip I'd found in a whitepaper (folding column rounding into row pass, genetic algorithm to tune... don't ask). I figured, hey... maybe I'll try intrinsics again... couldn't hurt, right? Visual C++ tends not to do well with MMX intrinsics, i.e. it misgenerates code, so I first emulated the MMX instructions with scalar code. When that worked, I tried rewriting the wrappers with SSE2 for speed.
Only to have the routine utterly and completely blow up.
Turns out that VS2005 SP1 has a bad bug with the above intrinsic that causes it to horribly screw up code generation -- the compiler actually generated all of the MOVQ instructions backwards, leading to code like this:
movq xmm1,xmm3 ;should be xmm3, xmm1
movq xmm1,mmword ptr [eax+8]
I submitted a bug on this, and of course, the response I got back was of the form: "we're not planning on fixing this for VS2005, but hey, you can buy VS2008." Thanks, but no, I'm not going to fork out hundreds of dollars for a release that gives me little except bug fixes. Oh, and though its code was correct, VS2008 still generated lots of useless code fragments like this:
Now, after having spent hours diagnosing and futilely trying to work around this bug in VS2005, I finally got tired of the state of Microsoft's code generation and decided to try GCC, which reportedly has much better support for vector intrinsics. So, after a lot of searching around to find a Win32 build of GCC 4 I could use without installing half of Cygwin and a bit of swearing to get all of the environment variables and paths set up, I got the app compiled under GCC 4.
Then I built it under -O3, and watched it output garbage.
It turns out that the _mm_loadl_epi64() intrinsic is evil for another reason. The MOVQ xmmreg, m64 instruction loads a 64-bit word, aligned or unaligned. Note, however, that the intrinsic takes a const _m128i * pointer, which would normally require an aligned 128 bit memory location. In order to use this intrinsic as intended you have to cast the pointer, and it turns out that doing so runs afoul of C++'s type aliasing rules, which then causes GCC 4 to generate broken code. When I inspected the disassembly, I eventually figured out that an entire multiply chain had disappeared due to this. Working around this is a huge pain in the butt (union), so in the end I just used -fno-strict-aliasing to get the code working again. In the end, it didn't matter much either way, because although Visual C++ loves orgies with register-to-register moves, GCC ended up wasting a bunch of cycles doing memory spills and movq2dq instructions, apparently unable to realize that only the low 64 bits were ever used. The code didn't run any faster.
_mm_storel_epi64() has similar problems, by the way.
Epic debugging sessions like this are the reason why I get so annoyed when people tell me I should stop writing assembly and switch to intrinsics. The fact is that even when the intrinsics actually do what I want and don't end up slower than even the original scalar code I had, every single time I've tried to use intrinsics I've gotten burnt, without exception.
I hope you at least submitted the bugs to the GCC developers... At least with them you wouldn't have to fork over hundreds of bugs for a bug-fix release.
Mitch 74 (link) - 27 02 08 - 02:43
Nope, two reasons I didn't... one is that it's an experimental MinGW build of GCC 4. The more important reason, though, is that I'm pretty sure the cast is invalid according to the C++ standard, due to the aliasing rules. That's why -fno-strict-aliasing fixes the problem. The entity that should really get the bug is Intel, because I think they're the ones who declared the intrinsic with the bogus type. I would have just used void *, myself.
Now, if you _really_ want to see badness, then compile _mm_set_epi16(0,0,0,0,0,0,0,1) with Visual C++, and watch the CPU gates weep....
Phaeron - 27 02 08 - 03:01
Can't NASM do SSE2? Why don't you just code the relevant parts directly in assembly? Yes, manual register management can be a pain in the butt, but it sure beats having non-working code...
Darkstar - 27 02 08 - 03:48
I think these MMX problems are partially addressed in gcc SVN post-4.3, but it doesn't do bitfield tracking on registers in general, so it might always miss cases like this. Anyway, I care about these things, so if you post the code I'll test it.
mvb - 27 02 08 - 12:31
I'd say that the main problem is due the architecture being a pain, the asm being ugly and the intrinsics taking the best of both (an ugly pain), sadly x86 and x86_64 are the main architectures and alternatives are harder and harder to get.
Luca Barbato - 27 02 08 - 13:22
Does the x64 version of VS2005 have the bug? Because with that you can't use inline assembly.
Yuhong Bao - 27 02 08 - 20:47
I answered my own question by referring to the feedback. No it don't fortunately, which is a good thing since you can't use inline asm.
Yuhong Bao - 27 02 08 - 20:50
I don't think that __m64 is supported on x64, BTW.
Yuhong Bao - 27 02 08 - 21:14
The IDCT routine I actually use is in assembly; I only decided to try intrinsics because I needed to hack the routine to tune it and didn't want to spend all of the time reflowing the registers. In retrospect, I would have just hacked the asm temporarily if I'd known I'd break two compilers in one day.
I don't use NASM, because the syntax is different and it can't output debug information compatible with the VC debugger. YASM can, but it's not worth porting my existing asm at the moment. I have 700K of asm and it's all written for MASM 6.15-8.0.
I don't think the compiler does either, but the CPU and OS definitely do -- MMX is context switched just fine and I think a VC team member said it's OK to use it in x64 user mode, just that there is no convention around it.
Phaeron - 27 02 08 - 23:19
Wow, I compiled the repro from connect and look at this:
add32x2(&y0, src+0, src+7);
movq xmm0, QWORD PTR [eax]
movq xmm2, QWORD PTR [eax+56]
add32x2(&y1, src+1, src+6);
movq xmm4, QWORD PTR [eax+48]
add32x2(&y3, src+3, src+4);
movq xmm6, QWORD PTR [eax+32]
movdqa xmm1, xmm0
paddd xmm1, xmm2
movq xmm3, xmm1 !!!
movq xmm1, QWORD PTR [eax+8]
add32x2(&y0, src+0, src+7);
movq xmm0,mmword ptr [eax]
movq xmm2,mmword ptr [eax+38h]
add32x2(&y1, src+1, src+6);
movq xmm4,mmword ptr [eax+30h]
add32x2(&y3, src+3, src+4);
movq xmm6,mmword ptr [eax+20h]
movq xmm1,xmm3 !!!
movq xmm1,mmword ptr [eax+8]
Gabest - 28 02 08 - 10:56
Oooh, now _that's_ interesting -- that implies there's something wrong with the back-end part of the compiler that generates machine code, not the code generator or optimizer. There have been issues where the VC8/9 debugger didn't disassemble instructions properly, but I think that was x64 related and clearly in the case the binary code is bad.
Unfortunately, there are known cases where the compiler generates .asm files that can't be assembled, and based on the feedback on Connect, they're probably not going to fix it. In fact, it's possible that after the grand rewrite for VC10 that it may not exist.
Phaeron - 29 02 08 - 01:00
Whoops, hadn't noticed that someone requested the original code. Here it is:
It's pretty nasty code, because I was too lazy to actually create a VC++ project for it and had just been editing it in Notepad. It's already using fairly well tuned rounding constants, so it won't do much, although there's an #if 0 you can toggle to make it use less optimal constants, in which case eventually you'll see it converge. For VC8/9, I used /Ox /Gs /GS-; for GCC, I used -O3 -march=pentium-m -mfpmath=sse, plus -fno-strict-aliasing once I had figured out what was going on.
The program takes the implementation of Intel's AP-922 IDCT algorithm from VirtualDub's mpeg_idct_mmx.asm module, and applies rounding optimizations from an article by id Software about real-time DXT texture compression (http://developer.nvidia.com/object/real-..
). The currently disabled path in the program contains the rounding constants from that paper; the ones currently enabled are better constants that were computed after several hours of running the tuning algorithm (during which I took the opportunity to grind a bit in La Pucelle).
Phaeron - 29 02 08 - 01:12
Someone pointed out to me that I had posted the wrong article link for the AP-922 IDCT rounding optimization. The correct paper is one that I'd actually found from the NVIDIA paper through multiple layers of references, called Real-Time Texture Streaming and Decompression:
Phaeron - 09 03 08 - 16:07
Why would you have to fork out hundreds of dollars for VS2008? If you want just the compiler and not the full IDE, hasn't Microsoft been giving that away?
James - 10 03 08 - 02:30
I try to avoid Frankenstein build setups. They do give away the Express Edition, which contains the optimizing compiler and the IDE, but MASM is missing from the toolchain and the IDE is limited in several important ways (no SCC integration, macros, Win32 resource editor, or 64-bit toolchain), so I'd have to mix the VS2008 compiler with the VS2005Pro IDE and that's a bad idea. There's also the problem that if I did switch to VS2008 I'd have to raise VirtualDub's minimum requirements, because VS2008's CRT does not support any version of Win9x. Upgrading Visual Studio is always a huge PITA, so I only want to do it when there's sufficient benefit and for C++ VS2008 doesn't come anywhere near meeting that bar. They might as well have called it Visual Studio 2005 OSR2.
I do actually have VC++ Express 9 installed, but I don't use it for any serious work -- the main reason I have it installed is so I know beforehand when I'm going to be screwed submitting a bug on VS2005, which is where I do all my work both professionally and non-professionally.
Phaeron - 10 03 08 - 04:28
"They might as well have called it Visual Studio 2005 OSR2."
If not for the .NET Framework 3.x parts.
Yuhong Bao - 23 03 08 - 16:42
Avery, have you tried using Intel Compiler with those intrinsics?
Igor Levicki (link) - 20 04 08 - 23:45
Oh and by the way latest Windows SDK should have newer compiler than VS2005 as well.
Igor Levicki (link) - 20 04 08 - 23:47
No, I haven't tried using the Intel compiler lately, although the last time I tried it it generated far more optimal code. There are too many issues with using Intel C/C++ in general, unfortunately, two issues being cost and stability of generated code.
The Windows SDK does indeed have an updated compiler, but it's easier just to get VC++ Express 9.0 for the VS2008 compiler instead. Unfortunately, there aren't any real improvements in intrinsics code generation, other than a couple of bug fixes. That having been said, the PSDK compiler isn't exactly the same as the mainline VS compiler, since the Windows team branches the compiler and maintains it separately, with the recent intrin.h conflicts being a symptom. It's probably possible to use the current VC9-based PSDK compiler with VS2005 SP1 by means of /useenv, but I haven't seen any real reason to do so.
Phaeron - 21 04 08 - 01:51