§ ¶A Visual C++ x64 code generation peculiarity
Take an SSE2 routine to do some filtering:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{
__m128i acc0 = _mm_setzero_si128();
__m128i acc1 = _mm_setzero_si128();
while(n--) {
const __m128i *kernel = &table[*indices++ * 2];
acc0 = _mm_add_epi16(acc1, kernel[0]);
acc1 = kernel[1];
*dst++ = acc0;
}
}
What this routine does is that it uses each index to look up a premultiplied kernel, and adds that to a short output window (8 samples). The output stream has 4x rate compared to the input stream. In a real routine the kernels would typically be a bit longer, but an example of where you might use something like this is to simultaneously upsample and convert a row of pixels or a block of audio through a non-linear curve.
If we look at the output of the VC++ x86 compiler, the result is decent:
0020: 0F B6 01 movzx eax,byte ptr [ecx]
0023: C1 E0 05 shl eax,5
0026: 66 0F 6F C1 movdqa xmm0,xmm1
002A: 66 0F FD 04 38 paddw xmm0,xmmword ptr [eax+edi]
002F: 66 0F 6F 4C 38 10 movdqa xmm1,xmmword ptr [eax+edi+10h]
0035: 66 0F 7F 02 movdqa xmmword ptr [edx],xmm0
0039: 41 inc ecx
003A: 83 C2 10 add edx,10h
003D: 4E dec esi
003E: 75 E0 jne 00000020
However, if we look at x64:
0010: 41 0F B6 00 movzx eax,byte ptr [r8]
0014: 48 83 C1 10 add rcx,10h
0018: 49 FF C0 inc r8
001B: 03 C0 add eax,eax
001D: 48 63 D0 movsxd rdx,eax
0020: 48 03 D2 add rdx,rdx
0023: 41 FF C9 dec r9d
0026: 66 41 0F FD 0C D2 paddw xmm1,xmmword ptr [r10+rdx*8]
002C: 66 0F 6F C1 movdqa xmm0,xmm1
0030: 66 41 0F 6F 4C D2 movdqa xmm1,xmmword ptr [r10+rdx*8+10h]
10
0037: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0
003C: 75 D2 jne 0010
It turns out that there are a couple of weirdnesses involved when the x64 compiler hits this code. The x86 compiler is able to fold the x2 from the indexing expression and the x16 from the 128-bit (__m128i) element size into a single x32, which is then converted into a left shift by 5 bits (shl). The x64 compiler is not, and ends up emitting x2 + x2 + x8. Why?
The clue as to what's going on is in the MOVSXD instruction, which is a sign extension instruction. According to the C/C++ standards, integral expressions involving values smaller than int are promoted to int, which in the case of Win32/Win64 is 32-bit. Therefore, the expression (*indices++ * 2) gives a signed 32-bit integer. For the x86 compiler, pointers are also 32-bit and so it just shrugs and uses the signed value. The x64 compiler has to deal with a conversion to a 64-bit pointer offset, however, and seems unable to recognize that an unsigned char multiplied by 2 will never be negative, so it emits sign extension code.
Therefore, we should change the code to remove the intermediate signed type:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{
__m128i acc0 = _mm_setzero_si128();
__m128i acc1 = _mm_setzero_si128();
while(n--) {
const __m128i *kernel = &table[*indices++ * 2U];
acc0 = _mm_add_epi16(acc1, kernel[0]);
acc1 = kernel[1];
*dst++ = acc0;
}
}
Now we are multiplying by an unsigned integer, so the result must be an unsigned int. The x64 compiler now generates the following:
0090: 4C 8B D2 mov r10,rdx
0093: 66 0F EF C9 pxor xmm1,xmm1
0097: 45 85 C9 test r9d,r9d
009A: 74 30 je 00CC
009C: 0F 1F 40 00 nop dword ptr [rax]
00A0: 41 0F B6 00 movzx eax,byte ptr [r8]
00A4: 48 83 C1 10 add rcx,10h
00A8: 49 FF C0 inc r8
00AB: 8D 14 00 lea edx,[rax+rax]
00AE: 48 03 D2 add rdx,rdx
00B1: 41 FF C9 dec r9d
00B4: 66 41 0F FD 0C D2 paddw xmm1,xmmword ptr [r10+rdx*8]
00BA: 66 0F 6F C1 movdqa xmm0,xmm1
00BE: 66 41 0F 6F 4C D2 movdqa xmm1,xmmword ptr [r10+rdx*8+10h]
10
00C5: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0
00CA: 75 D4 jne 00A0
Better, but still not quite there. The x64 compiler no longer needs to sign extend the offset, and therefore can now take advantage of the implicit zero extension in x64 when working with 32-bit registers. (New x64 programmers are often confused by the compiler emitting MOV EAX, EAX instructions, which are not no-ops as they zero the high dword.) However, the compiler is still unable to fuse the additions together. A bit of experimentation with the kernel size multiplier reveals that the x64 compiler has an unusual attachment to the trick of doing an x2 add followed by an x8 scale in order to index 16-byte elements. In this particular case there's a possibility that the two adds might be faster than a shift on some CPUs, but with larger multipliers the compiler generates a SHL followed by an ADD, which is never optimal. Therefore, let's take over the indexing entirely:
#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{
__m128i acc0 = _mm_setzero_si128();
__m128i acc1 = _mm_setzero_si128();
while(n--) {
const __m128i *kernel = (const __m128i *)
((const char *)table + (*indices++ * 32U));
acc0 = _mm_add_epi16(acc1, kernel[0]);
acc1 = kernel[1];
*dst++ = acc0;
}
}
Ugly? Definitely, but we're having to work around optimizer shortcomings here. Result:
0060: 41 0F B6 00 movzx eax,byte ptr [r8]
0064: 48 83 C1 10 add rcx,10h
0068: 49 FF C0 inc r8
006B: C1 E0 05 shl eax,5
006E: 41 FF C9 dec r9d
0071: 66 0F FD 0C 10 paddw xmm1,xmmword ptr [rax+rdx]
0076: 66 0F 6F C1 movdqa xmm0,xmm1
007A: 66 0F 6F 4C 10 10 movdqa xmm1,xmmword ptr [rax+rdx+10h]
0080: 66 0F 7F 41 F0 movdqa xmmword ptr [rcx-10h],xmm0
0085: 75 D9 jne 0060
That's better.
Conclusion: check your critical loops after porting to x64, even if they're using intrinsics and don't require fixes. There may be changes in code generation due to both architectural and compiler differences.
Comments
Comments posted:
GCC 4.6.1 x86-64 has the exact same behavior. This coincidence is peculiar: when checking school tests, when two students have the exact same mistake, the teacher suspects cheating.
Z.T. - 29 06 11 - 23:08
Well, if the compiler isn't clever enough to realize that a byte multiplied by two won't overflow into the sign bit then I wouldn't surprised if it couldn't figure out that it can't overflow the 32-bit integer outright either. What happens if you give it a subtler hint by multiplying by 2ULL instead?
As an aside working with (usually less-than-clever) 8-bit embedded compilers gives you something of an eye for spotting potential integer promotion inefficiencies.
doynax - 01 07 11 - 05:53
You still get screwed by type conversion rules in the above code - that's also why both GCC and VC++ agree about the way the address expression is generated. :)
To be more specific, the multiply by 2u is still done as a U32. So what the compiler sees (after the address expression for the table access has been generated) is, when written with explicit types, "U64_from_U32(U32_from_U8(*indices++) * 2u) * 16ull". For general expressions, the compiler isn't allowed to simplify this further; of course in this case we happen to know (because we start from a U8 so the first multiply can't overflow) that the expression is really the same as "U64_from_U8(*indices++) * 32ull". Your modified code (with the casting) takes a third option - it replaces the address expression with "U64_from_U32(U32_from_U8(*indices++) * 32u)" - again we know this can't overflow so it's identical. It still has one redundant conversion in it, but since x86-64 (and most other 64-bit architectures) implicitly zero-extend results of 32-bit computations by clearing the high 32 bits this conversion is free.
Note that you get a "shl eax, 5" not "shl rax, 5". That's because you still work in U32 temps. For a shift of 5 this doesn't matter, but if you have e.g. a shift of 3 you want (which could use the x86 scaled-index address modes) you really need to use the convert-U8-to-U64-then-multiply version.
The underlying problem for all of this is that the C(++) integral types are int-centric and try to do work with ints whenever possible; this was great for 16-bit processors with 16-bit ints or 32-bit processors with 32-bit ints, but now that we have 64-bit processors and use 32-bit ints it causes a lot of friction.
Fabian Giesen (link) - 01 07 11 - 13:05
You're right about a 64-bit cast also fixing the issue -- I had forgotten about that. (I'd cast to size_t instead of unsigned long long, but it's the same here.) However, looking at the C++ standard again, I'm not sure that the U64_from_U32() cast you mention actually exists. We know that it has to happen in order for the instruction addressing to occur, but we're talking about what the standard specifies directly, since the compiler is allowed by bypass any part of it that it can through the as-if rule. The indexing expression a[b] is converted to *(a + b), and since the "usual arithmetic conversions" don't include any special provisions for a pointer being involved, the offset will be promoted only up to int or unsigned int. The pointer+offset addition itself is then governed by 5.7p5, which simply describes the result of addition with a pointer without any description of specific types. I don't see any language that says that the integer value is to be promoted to pointer size prior to addition, and I'm not even sure that's possible given that there may not be such an integer type. This also fits more naturally with CPU architectures that allow indexing with a smaller data type than the base address, 6502 and 68000 coming to mind.
Again, though, there's nothing that prohibits the compiler from generating a single SHL instruction. The "as-if" rule would allow the compiler to remove all of the extraneous casts here according to detected value ranges. What we're really talking about here is just code generation quality.
Phaeron - 02 07 11 - 09:07
"I don't see any language that says that the integer value is to be promoted to pointer size prior to addition, and I'm not even sure that's possible given that there may not be such an integer type."
Correct; the cast to U64 here is not part of C(++) semantics, it's just there in the case of x86-64 (and all other 64-bit platforms that I've written code for, for that matter), where both pointers are GPRs happen to have 64-bit size. That add has to be done *somehow*; either all values involved are promoted to a common size prior to addition (which I'm implicitly assuming in my post), or there is an add instruction that takes mixed-size operands, or there is an addressing mode that allows a smaller-than-pointer-sized index. In the case of x86-64, there's no mixed-size adds and all the address calculations are done in full 64 bits, so it must be the first option.
"Again, though, there's nothing that prohibits the compiler from generating a single SHL instruction."
Yes, in this case; but if you were passing a "const unsigned int *indices" instead, there would be; that's not meant to imply that compilers shouldn't be doing a better job at this, they should. But I do think that the way the C/C++ type promotion rules work out here (doing the multiply on int-sized values), while well-defined and internally consistent, is spectacularly non-intuitive and bound to cause numerous people to unknowingly shoot themselves in the foot; that's unfortunate.
"The "as-if" rule would allow the compiler to remove all of the extraneous casts here according to detected value ranges. What we're really talking about here is just code generation quality."
Yes - it's just that it does require an additional pass of data-flow analysis (tracking upper and lower bounds for arithmetic expressions based on the original types to check whether an overflow might've occurred or not) that wasn't necessary in 32-bit code; we'll get there eventually, but progress on these kinds of optimizations is way slower than we'd like, especially since all the major compilers have been distracted by other issues for the past few years: MS has been pouring most of its resources into .NET for years only recently realizing that this whole "native code" thing might not just go away after all :), GCC is still caught up in a cascade of major infrastructure upgrades to drag their compiler kicking and screaming into this millennium (plus they have this unhealthy culture of language lawyering beating serious practical concerns sometimes), and ICCs main mission still seems to be raking up good SPEC CPU/FP results. LLVM to the rescue? We'll see :)
Fabian Giesen (link) - 02 07 11 - 15:38
Any reason why writing the asm inline directly is not a better idea? It might be harder to read but you could solve that by leaving the C/pseudocode in comments.
Kentaro (link) - 03 07 11 - 12:38
Kentaro: VC++ on x86-64 doesn't support inline ASM at all, only intrinsics (they had to add a bunch of new intrinsics to make that viable). Which is totally moronic, but there you go.
Fabian Giesen (link) - 03 07 11 - 14:57
Here is what Intel Compiler 12.0.4.147 generates for your C code:
test r9d, r9d
pxor xmm0, xmm0
lea eax, DWORD PTR [-1+r9]
je .B1.5
.B1.3::
movzx r9d, BYTE PTR [r8]
movdqa xmm1, xmm0
add r9d, r9d
dec eax
inc r8
shl r9, 4
paddw xmm1, XMMWORD PTR [r9+rdx]
movdqa xmm0, XMMWORD PTR [16+r9+rdx]
movdqa XMMWORD PTR [rcx], xmm1
add rcx, 16
cmp eax, -1
jne .B1.3
.B1.5::
ret
So, is that better than MSVC/GCC or worse?
Igor Levicki (link) - 10 07 11 - 08:58