Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

A Visual C++ x64 code generation peculiarity

Take an SSE2 routine to do some filtering:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

What this routine does is that it uses each index to look up a premultiplied kernel, and adds that to a short output window (8 samples). The output stream has 4x rate compared to the input stream. In a real routine the kernels would typically be a bit longer, but an example of where you might use something like this is to simultaneously upsample and convert a row of pixels or a block of audio through a non-linear curve.

If we look at the output of the VC++ x86 compiler, the result is decent:

0020: 0F B6 01           movzx       eax,byte ptr [ecx]
0023: C1 E0 05           shl         eax,5
0026: 66 0F 6F C1        movdqa      xmm0,xmm1
002A: 66 0F FD 04 38     paddw       xmm0,xmmword ptr [eax+edi]
002F: 66 0F 6F 4C 38 10  movdqa      xmm1,xmmword ptr [eax+edi+10h]
0035: 66 0F 7F 02        movdqa      xmmword ptr [edx],xmm0
0039: 41                 inc         ecx
003A: 83 C2 10           add         edx,10h
003D: 4E                 dec         esi
003E: 75 E0              jne         00000020

However, if we look at x64:

0010: 41 0F B6 00        movzx       eax,byte ptr [r8]
0014: 48 83 C1 10        add         rcx,10h
0018: 49 FF C0           inc         r8
001B: 03 C0              add         eax,eax
001D: 48 63 D0           movsxd      rdx,eax
0020: 48 03 D2           add         rdx,rdx
0023: 41 FF C9           dec         r9d
0026: 66 41 0F FD 0C D2  paddw       xmm1,xmmword ptr [r10+rdx*8]
002C: 66 0F 6F C1        movdqa      xmm0,xmm1
0030: 66 41 0F 6F 4C D2  movdqa      xmm1,xmmword ptr [r10+rdx*8+10h]
      10
0037: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
003C: 75 D2              jne         0010

It turns out that there are a couple of weirdnesses involved when the x64 compiler hits this code. The x86 compiler is able to fold the x2 from the indexing expression and the x16 from the 128-bit (__m128i) element size into a single x32, which is then converted into a left shift by 5 bits (shl). The x64 compiler is not, and ends up emitting x2 + x2 + x8. Why?

The clue as to what's going on is in the MOVSXD instruction, which is a sign extension instruction. According to the C/C++ standards, integral expressions involving values smaller than int are promoted to int, which in the case of Win32/Win64 is 32-bit. Therefore, the expression (*indices++ * 2) gives a signed 32-bit integer. For the x86 compiler, pointers are also 32-bit and so it just shrugs and uses the signed value. The x64 compiler has to deal with a conversion to a 64-bit pointer offset, however, and seems unable to recognize that an unsigned char multiplied by 2 will never be negative, so it emits sign extension code.

Therefore, we should change the code to remove the intermediate signed type:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2U]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

Now we are multiplying by an unsigned integer, so the result must be an unsigned int. The x64 compiler now generates the following:

0090: 4C 8B D2           mov         r10,rdx
0093: 66 0F EF C9        pxor        xmm1,xmm1
0097: 45 85 C9           test        r9d,r9d
009A: 74 30              je          00CC
009C: 0F 1F 40 00        nop         dword ptr [rax]
00A0: 41 0F B6 00        movzx       eax,byte ptr [r8]
00A4: 48 83 C1 10        add         rcx,10h
00A8: 49 FF C0           inc         r8
00AB: 8D 14 00           lea         edx,[rax+rax]
00AE: 48 03 D2           add         rdx,rdx
00B1: 41 FF C9           dec         r9d
00B4: 66 41 0F FD 0C D2  paddw       xmm1,xmmword ptr [r10+rdx*8]
00BA: 66 0F 6F C1        movdqa      xmm0,xmm1
00BE: 66 41 0F 6F 4C D2  movdqa      xmm1,xmmword ptr [r10+rdx*8+10h]
      10
00C5: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
00CA: 75 D4              jne         00A0

Better, but still not quite there. The x64 compiler no longer needs to sign extend the offset, and therefore can now take advantage of the implicit zero extension in x64 when working with 32-bit registers. (New x64 programmers are often confused by the compiler emitting MOV EAX, EAX instructions, which are not no-ops as they zero the high dword.) However, the compiler is still unable to fuse the additions together. A bit of experimentation with the kernel size multiplier reveals that the x64 compiler has an unusual attachment to the trick of doing an x2 add followed by an x8 scale in order to index 16-byte elements. In this particular case there's a possibility that the two adds might be faster than a shift on some CPUs, but with larger multipliers the compiler generates a SHL followed by an ADD, which is never optimal. Therefore, let's take over the indexing entirely:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = (const __m128i *)
((const char *)table + (*indices++ * 32U))
;
acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

Ugly? Definitely, but we're having to work around optimizer shortcomings here. Result:

0060: 41 0F B6 00        movzx       eax,byte ptr [r8]
0064: 48 83 C1 10        add         rcx,10h
0068: 49 FF C0           inc         r8
006B: C1 E0 05           shl         eax,5
006E: 41 FF C9           dec         r9d
0071: 66 0F FD 0C 10     paddw       xmm1,xmmword ptr [rax+rdx]
0076: 66 0F 6F C1        movdqa      xmm0,xmm1
007A: 66 0F 6F 4C 10 10  movdqa      xmm1,xmmword ptr [rax+rdx+10h]
0080: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
0085: 75 D9              jne         0060

That's better.

Conclusion: check your critical loops after porting to x64, even if they're using intrinsics and don't require fixes. There may be changes in code generation due to both architectural and compiler differences.

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.