Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Compiler intrinsics... again

You know that episode of The Simpsons where Bart reaches for the electrified cookie jar and goes "ow," and then just keeps doing it again and again? Yeah, I'm like that with compiler intrinsics.

Let's take a simple routine:

__m128i fold1(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(mask, _mm_srli_epi16(x, 1)), _mm_and_si128(mask, x));
}

This is one step of a population count routine, which folds pairs of bits together into two-bit counts. (Yeah, I know this can be done better with subtraction, but popcount isn't the subject here.) Run this through VC10, and you get this:

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
movdqa      xmm0,xmm1
movdqa      xmm3,xmm2
psrlw       xmm3,1
pand        xmm0,xmm3
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Unnecessary moves blah blah blah... you've heard it here before. Then again, let's take a closer look. Why did the compiler emit the MOVDQA XMM3, XMM2 instruction? Hmm, it's because it did the shift next, but it still needed to keep "x" around for the second operation. And how about that PAND that follows? Well, it couldn't modify "mask," so it copied that too. Waaaiit a minute, it's just doing everything exactly the way I told it. That might be OK if x86 used three-argument form instructions, but since x86 is two-argument, that kinda sucks. What about if we rewrote the routine this way:

__m128i fold2(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(_mm_srli_epi16(x, 1), mask), _mm_and_si128(mask, x));
}

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
psrlw       xmm0,1
pand        xmm0,xmm1
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Well, that looks a bit better. It appears that Visual C++ is unable to take advantage of the fact that the binary operations used here are commutative, which means that the efficiency of the code generated can differ significantly based on the order of the arguments even though the result is the same. The upside is that you can swap around arguments to get better code; the downside is that you're doing what the code generator should be doing. Interestingly, based on some experiments it looks like the code generator can do this for scalar operations, so something didn't get hooked up or extended to the intrinsics portion.

Anyway, if you've got extra moves showing up in the disassembly when using intrinsics, try shaking the expression tree a bit and see if some of the moves fall out.

Comments

Comments posted:


Have you ever tried this with a different compiler? I have GCC and Intel's compiler in mind, would be nice to know if they exhibit the same problem (I hope not)...

Lrdx - 27 05 10 - 21:29


I know popcount wasn't the topic here, but GCC has __builtin_popcount() (and I'm sure all modern compiles have this too). Haven't looked at the code it produces for modern archs, but apparently it's slow http://wm.ite.pl/articles/sse-popcount.h.. which is a bit sad given that you'd want to use it to take advantage of POPCNT in SSE4{2,a}, unless your work is implying that that too is slower?

eloj - 27 05 10 - 23:13


Haven't tried GCC, but the Intel compiler was much better at this, even back at 6.0. I do recall seeing an intrinsics optimization chart a while back that showed GCC doing a lot more optimizations than VC++.

__builtin_popcount() looks like a scalar intrinsic. I don't believe there is a POPCNT equivalent in the ISA for vectors, so you're kind of stuck there.

Phaeron - 28 05 10 - 15:51


... tried to solve it with pandn: _mm_add_epi16(_mm_and_si128(x, mask), _mm_srli_epi16(_mm_andnot_si128(mask, x), 1)), the order of instructions still matters :)

movdqa xmm1, XMMWORD PTR __xmm@0
movdqa xmm2, xmm0
pand xmm0, xmm1
pandn xmm1, xmm2
psrlw xmm1, 1
paddw xmm0, xmm1

Gabest - 28 05 10 - 17:53


Here is what latest Intel Parallel Composer Beta output looks like:

movdqa xmm2, xmm0
movdqa xmm1, XMMWORD PTR [_2il0floatpacket.1]
psrlw xmm2, 1
pand xmm0, xmm1
pand xmm2, xmm1
paddw xmm2, xmm0
movdqa xmm0, xmm2
ret

Not bad, but it could have been better if it have chosen registers such that it ends up with the result in XMM0 instead of having to copy the result.

Igor Levicki (link) - 29 05 10 - 00:22


I'd be interested in seeing the results of your own experiments with MinGW32 and MinGW64 (now available on TDM-GCC distribution and are based on GCC 4.5.0, the latest version of GCC at this time) on compiler intrinsics, in comparison with Visual Studio.

Conan Kudo (ニール・ゴンパ) (link) - 19 06 10 - 03:24

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.