Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Compiler intrinsics, revisited

I received an email recently from a member of the Microsoft Visual C++ compiler team who is working on the AMD64 compiler, regarding my comments about intrinsics support in the VC++ compiler. Given my past feedback on this blog and in the MSDN Product Feedback Center on the quality of the intrinsics in VC++, one of two possibilities was possible:

Fortunately, the team member turned out to be a nice guy and informed me that intrinsics support had indeed been improved in Whidbey.

To review, compiler intrinsics are psuedo-functions that expose CPU functionality that doesn't fit well into C/C++ constructs. Simple operations like add and subtract map nicely to + and -, but four-way-packed-multiply-signed-and-add-pairs doesn't. Instead, the compiler exposes a __m64 type and _m_pmaddwd() psuedo-function that you can use. In theory, you get the power and speed of specialized CPU primitives, with some of the portability benefits of using C/C++ over straight assembly language. The problem in the past was that Visual Studio .NET 2003 and earlier generated poor code for these primitives that was either incorrect or slower than which could be written straight in assembly language with moderate effort.

The good news

Here's the routine using SSE2 intrinsics that I used to punish the compiler last time I wrote about this problem:

#include <emmintrin.h>

unsigned premultiply_alpha(unsigned px) {
 __m128i px8 = _mm_cvtsi32_si128(px);
 __m128i px16 = _mm_unpacklo_epi8(px8, _mm_setzero_si128());
 __m128i alpha = _mm_shufflelo_epi16(px16, 0xff);

 __m128i result16 = _mm_srli_epi16(_mm_mullo_epi16(px16, alpha), 8);

 return _mm_cvtsi128_si32(_mm_packus_epi16(result16, result16));
}

Here's what Visual Studio .NET 2003 generates for this function:

pxor    xmm0, xmm0
movdqa  xmm1, xmm0
movd    xmm0, ecx
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
movdqa  xmm1, xmm0
packuswb xmm1, xmm0
movd    eax, xmm1
ret

Note the unnecessary movdqa instructions; these are expensive on Pentium 4, where each one adds 6 clocks to your dependency chain.

Here's what Visual Studio .NET 2005 generates for this function:

pxor    xmm1, xmm1
movd    xmm0, ecx
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
packuswb xmm0, xmm0
movd    eax, xmm0
ret

That's actually not too bad. Fairly good, even.

The bad news

The last time I did this test, I posted the following result for Visual Studio .NET 2003:

push    ebp
mov     ebp, esp
pxor    xmm0, xmm0
movdqa  xmm1, xmm0
movd    xmm0, DWORD PTR _px$[ebp]
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
movdqa  xmm1, xmm0
packuswb xmm1, xmm0
and     esp, -16
movd    eax, xmm1
mov     esp, ebp
pop     ebp
ret     0

The reason for the discrepancy is that I cheated in the tests above by using the /Gr compiler switch to force the __fastcall calling convention. Part of the problem with the VC++ intrinsics is that they have a habit of forcing an aligned stack frame if any stack parameters are accessed in a function that uses intrinsics, even if no aligned parameters are necessary. This is unfortunate as it slows down the prolog/epilog and eats an additional register. Sadly, this is not fixed in Whidbey, although it is a moot point on AMD64 where the stack is always 16-byte aligned. Using the fastcall convention can fix this on x86 if all parameters can be pushed to registers, but this isn't possible if you have more than 8 bytes of parameters.

The other bad news is that the MMX instrinsics still produce awful code, although this is only pertinent to x86 since the AMD64 compiler doesn't support MMX instructions, and at least the bugs with MMX code moving past floating-point or EMMS instructions have been fixed:

pxor      mm1, mm1
movd      mm0, ecx
punpcklbw mm0, mm1
movq      mm1, mm0
movq      mm2, mm0
punpckhwd mm1, mm2
movq      mm2, mm1
punpckhwd mm2, mm1
pmullw    mm0, mm2
psrlw     mm0, 8
movq      mm1, mm0
packuswb  mm1, mm0
movd      eax, mm1
emms
ret

Conclusions

The aligned stack frame is a bummer for codelet libraries, but it isn't so big of a deal if you can isolate intrinsics code into big, long-duration functions and the function isn't under critical register pressure. The improvements to SSE2 intrinsics code generation make them more attractive in Whidbey, but since AMD64 is not widespread and SSE2 is only supported on Pentium 4, Pentium M, and Athlon 64 make them unusable for mainstream code on x86. They're also rather difficult to read compared to assembly code. I still don't think I'd end up using them even after Whidbey ships, because it would make my x86 and AMD64 codebases diverge farther without much gain.

Another problem is that although all SSE2 instructions are available through intrinsics, and many non-vector intrinsics have been added in Whidbey, there are still a large number of tricks that can only be done directly in assembly language, many of which involve extended-precision arithmetic and the carry flag. The one that I use all the time is the split 32:32 fixed-point accumulator, where two 32-bit registers hold the integer and fractional parts of a value. This is very frequently required in scaling and interpolation routines. The advantage is that you can get to the integer portion very quickly. In x86:

add ebx, esi
adc ecx, edi
mov eax, dword ptr [ecx*4]

In AMD64 you can sometimes get away with half the registers if you only need a 32-bit result, by swapping the low and high halves and wrapping the carry around:

add rbx, rcx
adc rbx, 0
mov [rdx], ebx

Compiler intrinsics don't let you do this.

Another problem I run into often in routines that are MMX or SSE2 heavy is a critical shortage of general purpose registers, usually for scanline pointers, fixed-point accumulators, and counters. The way I get around this on x86 is to make use of the Structured Exception Handling (SEH) chain to temporarily hold the stack pointer, freeing it for use as an eighth general purpose register:

push 0
push fs:dword ptr [0]
mov fs:dword ptr [0], esp
...
mov esp, fs:dword ptr [0]
pop fs:dword ptr [0]
pop eax

...and then be really careful not to cause an exception while within the block.

This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though.

Comments

Comments posted:


Damn, I was hoping it'd be the formal challenge to the death option. Phaeron taking on the compiler team ala The Bride vs. the crazy 88s.

Why'd you say using intrinsics would make the x86 and AMD64 codebases diverge farther? Wouldn't it do the opposite, avoiding the need for seperate AMD64 assembly code?

Andrew Dunstan - 21 04 05 - 18:10


The official word from Microsoft is that the x87/MMX register file is banned and that only SSE/SSE2 should be used in x64 code. This means that code that is optimally written in MMX or integer SSE must be rewritten into SSE2. The instructions are so similar that it is possible to use shared assembly code for both with some simple macros. The intrinsics, however, differ more significantly. The MMX intrinsics have both asm-like names and generalized names, but the SSE2 ones only have the generalized names, and those are rather unreadable. It is possible to wrap them in much cleaner operator overloads but my experience with VS2003 was that doing so was a magnet for intrinsics optimization bugs.

The original rumor was that Windows x64 wouldn't even save and restore the x87 register file, which made no sense because it'd be a security hole and would be saved anyway by FXSAVE/FXRSTOR; experiments show that the OS does save x87 and I saw an MS blog a while back that implied that it was OK to use it, just not recommended. There are some cases where the additional parallelism in SSE2 cannot be used and the additional execution loads generated by SSE2 are a waste, such as texture mapping. P4 is issue bottlenecked so this isn't much of an issue, but Athlon and P-M break SSE2 ops into two 64-bit ops and have smaller schedulers.

Phaeron - 23 04 05 - 01:27


I see. Got another question for you (I could probably go on all day, but Iíll try not to): What do you propose is the best way to save the non-volatile xmm registers when they need to be used? It used to be only the low 64-bits that needed to be saved, but that seems to have changed to include the whole register. Using the stack seems to require a lot of extra work.

Andrew Dunstan - 23 04 05 - 12:47


I used to stack XMM registers together using MOVLHPS/MOVHLPS, but I think you have to store them on the stack. The reason is that register usage is much more important on AMD64 and the unwind is table-based. If you don't, floating-point values may be trashed after an exception unwind.

Setting up a prolog and epilog manually in AMD64 so you can stack registers is a bit of a pain but can be wrapped in macros. The prolog may be slightly suboptimal due to a lack of scheduling with surrounding code, but the epilog must have one of two specific forms anyway as it is parsed directly.

Phaeron - 23 04 05 - 16:33


The prolog and epilog stuff I understand; it's the function table entry stuff (PDATA and XDATA) that's a bit confusing.

Andrew Dunstan - 23 04 05 - 20:48


Use the % operator to do a remainder.

Yuhong Bao - 21 10 07 - 15:56


"This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though."
And on AMD64 it is not necessary anyways as it has 8 more registers.

Yuhong Bao - 04 07 08 - 02:44

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.