Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Weird optimizer behavior

Take this simple function:

int foo(int p0, int p1) {
return p1 + ((p0 - p1) & ((p0 - p1) << 31));
}

This should generate four operations: a subtract, a shift, a bitwise AND, and an add. Well, with VS2005 it generates something a bit different:

  00000000: 8B 54 24 08        mov         edx,dword ptr [esp+8]
  00000004: 8B 4C 24 04        mov         ecx,dword ptr [esp+4]
  00000008: 2B CA              sub         ecx,edx
  0000000A: 8B C1              mov         eax,ecx
  0000000C: 69 C0 00 00 00 80  imul        eax,eax,80000000h ???
  00000012: 23 C1              and         eax,ecx
  00000014: 03 C2              add         eax,edx
  00000016: C3                 ret

Somehow the optimizer changed the shift to a multiply, which is a serious pessimization and thus results in a rare case where the code is actually faster with the optimizer turned off!

Oddly enough, manually hoisting out the common subexpression (p0 - p1) fixes the problem. I've seen this behavior before in VC++ with 64-bit expressions of the form (a*b+c). My guess is that the compiler normally converts left shifts to multiplications and then converts them back later, but somehow the CSE optimization breaks this. Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.

The reason for the repeated subexpression, by the way, is because this is an expanded version of a min() macro. I called the function foo above instead of min because it's actually broken -- the left shift should be a right shift. As long as you can put up with the portability and range quirks, this strange formulation has the advantages of (a) being branchless, and (b) sharing a lot of code with a max() on the same arguments.

Comments

Comments posted:


Nice catch.

Btw, I've used movd + pmin/pmax* + movd before, but will give a try to this one in the future. Frees up one xmm register.

Gabest - 28 12 08 - 07:03


It seems to me that the shift is not portable because

Tobias Rieper - 28 12 08 - 09:58


looks like the html was messed up by my previous post because i used angle brackets to denote the shift op ;-)

It seems to me that the shift is not portable because the exact behaviour of shift regarding sign extension is undefined by the standard. The trick doesnt work in C# either because left shift does not sign extend (this however is guaranteed by the standard).

however, it is astonishing in how many ways you can use simple instructions. maybe you could even write a program to enumerate all possible instructon sequences with length less than 5 to find the fastest ones for a given operation. i think such a tool would great in optimizing hot path and mmx/sse sequences.

Tobias Rieper - 28 12 08 - 11:57


Gabest:
If you're going to use this trick at the asm level, there is a further optimization you can make that isn't ordinarily possible in C, which is to use CDQ instead of MOV + SAR. CDQ is shorter and lower latency on some CPUs. It should end up being something like mov/sub/cdq/and/add. Add an extra sub for both min/max at the same time.

By the way, why were you doing min/max through XMM registers? If you have SSE2, can't you do CMP+CMOV? MOVD between GPRs and XMM registers is slow on a lot of CPUs.

Tobias Rieper:
Stupid blog software. Oddly enough, I did get your whole post in the email notification. Guess I'll have to go hack the output routine.

You're correct that the behavior of a right shift on a signed is not defined by C, but it is not undefined -- it is implementation defined. This means that you are OK doing this trick as long as you stipulate that your software needs to run on a 2's complement machine, which is the case for just about every current platform out there. (You're already in platform specific territory here anyway because of the need to know the bit size of type int.) I'm not sure what you're getting at with regard to C#, because sign extension can't occur for left shifts, and for signed types C# defines right shifts to sign extend, just like is common in C. In fact, just about all of these tricks will work fine in C#, although I'd recommend wrapping them in unchecked() as necessary in case someone wants to throw the overflow switch.

Phaeron - 28 12 08 - 16:40


well i copied your code literally into vs c# and the results were completely wrong. i used the following code:

[PexMethod(MaxRuns = 1000000000, MaxRunsWithoutNewTests = 1000000000, Timeout = 1000000000, MaxExecutionTreeNodes = 1000000000, MaxWorkingSet = 1000000000)]
public void TestMethod1(int a, int b)
{
int min1 = Math.Min(a, b);

int diff = a - b;
int min2 = b + (diff & (diff

Tobias Rieper - 28 12 08 - 17:10


aw man i forgot the left shift chars. the post was meant to be:

well i copied your code literally into vs c# and the results were completely wrong. i used the following code:
[PexMethod(MaxRuns = 1000000000, MaxRunsWithoutNewTests = 1000000000, Timeout = 1000000000, MaxExecutionTreeNodes = 1000000000, MaxWorkingSet = 1000000000)]
public void TestMethod1(int a, int b)
{
int min1 = Math.Min(a, b);

int diff = a - b;
int min2 = b + (diff & (diff LSHIFT 31));

PexAssert.AreEqual(min1, min2);
}

for a = 0, b = 1, i get min2 = -2147...
what am i missing?
btw, i was using microsoft pex to test the algorithm for all possible inputs. thats where the strange attribute comes from.

Tobias Rieper - 28 12 08 - 17:14


You missed the last paragraph of my post. The left shift has to be changed to an arithmetic right shift for the code to actually work. :)

Phaeron - 28 12 08 - 17:31


@Tobias: Actually, you're not the first person to consider such a program... the "superoptimizer." Mind you, that's looking into the shortest byte sequence, not fastest execution time, but there's a good correlation between the two. On the upside, a superoptimizer would not have been satisfied with the example in the original post, because the shift would be expressed in less than 6 bytes, and would thus be preferable. :)

Lorraine - 28 12 08 - 19:48


It's been done:
http://directory.fsf.org/project/superop..

One of the most obvious problems with such an algorithm is, of course, running time. It can take a long time to get even a moderate size routine optimized in this way. The bigger and more subtle problem, though, is that real gains from optimizing at this level usually require restricting the problem domain in order to support non-general optimizations. For instance, if I simply told you that I needed a fast minimum or maximum function, the algorithm described here would be an invalid optimization because it doesn't work over the full 32-bit range -- it fails whenever the signed difference of the two overflows. This is fairly common for integer tricks, and at the very least I'd have to tell you beforehand that the integers had a restricted range. A more realistic scenario, which is even harder for a tool to accommodate, is that I'd have to be told that such an optimization was available so I could see if the surrounding code could be restructured to accommodate it.

Phaeron - 28 12 08 - 20:53


If the flaw exhibits only on wrong (not practical) algorithm, and the code is at least not wrong (i.e. not affecting the result), then can't it be considered as not a bug? After all we're talking about optimizer, which should output best code on most likely usage, not corner case. I'm not sure whether this matter would be triggered in real code or not, though.

sbb edx, edx - 31 12 08 - 02:06


It's certainly not a bug I'm losing sleep over, but it's sufficiently weird that I'd call it a bug anyway. As it turns out, this only triggers for a shift of 31, so it is rare. I can't think of any circumstance in which an IMUL would be advantageous here, given that the src/dst registers are the same and that SHL always has at least as much throughput and same or lower latency.

As for the pattern in question, yes, it's incorrect in this case, but it's simple enough that I could see it occuring in production code, especially if macros or generated code are involved. The essential elements appear to be an add/subtract in the common subexpression and a logical AND or OR with a 31-bit left shift. A check if a difference or sum is odd or negative is not that unusual of a check to imagine.

I should note that I posted this more out of curiosity rather than a gripe. There are definitely much worse optimizer bugs to worry about, such as the bug where _m128 return values aren't always aligned properly....

Phaeron - 31 12 08 - 03:02


> Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.

Trusting the compiler to do its job isn't laziness, even if sometimes compilers don't live up to it...

Also odd: if you replace the left shift with an actual multiply, you get the SHL.
return p1 + ((p0 - p1) & ((p0 - p1) * 2147483648));

Glenn Maynard - 04 01 09 - 18:27


> > Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.
> Trusting the compiler to do its job isn't laziness, even if sometimes compilers don't live up to it...

I might be more willing to accept that if so many optimization algorithms weren't NP time, forcing compilers to do a sub-optimal job for practical reasons. Few compiler vendors provide adequate guarantees as to when optimizations will be applied or provide annotations you can use to control or enforce them.

Phaeron - 04 01 09 - 18:44


"By the way, why were you doing min/max through XMM registers? If you have SSE2, can't you do CMP+CMOV? MOVD between GPRs and XMM registers is slow on a lot of CPUs"

No assembly, I'm only using intrinsics, it's easier to compile for x86 and x64 at the same time.

Gabest - 07 01 09 - 19:26


The compiler's not as dumb as you think - multiply is faster than shift on most Pentium 4s since they don't have a barrel shifter. (I think the Prescott/Presler cores fixed that, but ugh).

Ian S. - 28 01 09 - 19:43


I have never heard of multiply being recommended above shift for P4, and the Pentopt tables say that multiply has considerably worse latency than shifts. It is true that a couple of ADDs is faster than a literal scalar integer left shift, however.

Phaeron - 29 01 09 - 01:09

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.