§ ¶Weird optimizer behavior
Take this simple function:
int foo(int p0, int p1) {
return p1 + ((p0 - p1) & ((p0 - p1) << 31));
}
This should generate four operations: a subtract, a shift, a bitwise AND, and an add. Well, with VS2005 it generates something a bit different:
00000000: 8B 54 24 08 mov edx,dword ptr [esp+8]
00000004: 8B 4C 24 04 mov ecx,dword ptr [esp+4]
00000008: 2B CA sub ecx,edx
0000000A: 8B C1 mov eax,ecx
0000000C: 69 C0 00 00 00 80 imul eax,eax,80000000h ???
00000012: 23 C1 and eax,ecx
00000014: 03 C2 add eax,edx
00000016: C3 ret
Somehow the optimizer changed the shift to a multiply, which is a serious pessimization and thus results in a rare case where the code is actually faster with the optimizer turned off!
Oddly enough, manually hoisting out the common subexpression (p0 - p1) fixes the problem. I've seen this behavior before in VC++ with 64-bit expressions of the form (a*b+c). My guess is that the compiler normally converts left shifts to multiplications and then converts them back later, but somehow the CSE optimization breaks this. Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.
The reason for the repeated subexpression, by the way, is because this is an expanded version of a min() macro. I called the function foo above instead of min because it's actually broken -- the left shift should be a right shift. As long as you can put up with the portability and range quirks, this strange formulation has the advantages of (a) being branchless, and (b) sharing a lot of code with a max() on the same arguments.
Comments
Comments posted:
Nice catch.
Btw, I've used movd + pmin/pmax* + movd before, but will give a try to this one in the future. Frees up one xmm register.
Gabest - 28 12 08 - 07:03
It seems to me that the shift is not portable because
Tobias Rieper - 28 12 08 - 09:58
looks like the html was messed up by my previous post because i used angle brackets to denote the shift op ;-)
It seems to me that the shift is not portable because the exact behaviour of shift regarding sign extension is undefined by the standard. The trick doesnt work in C# either because left shift does not sign extend (this however is guaranteed by the standard).
however, it is astonishing in how many ways you can use simple instructions. maybe you could even write a program to enumerate all possible instructon sequences with length less than 5 to find the fastest ones for a given operation. i think such a tool would great in optimizing hot path and mmx/sse sequences.
Tobias Rieper - 28 12 08 - 11:57
Gabest:
If you're going to use this trick at the asm level, there is a further optimization you can make that isn't ordinarily possible in C, which is to use CDQ instead of MOV + SAR. CDQ is shorter and lower latency on some CPUs. It should end up being something like mov/sub/cdq/and/add. Add an extra sub for both min/max at the same time.
By the way, why were you doing min/max through XMM registers? If you have SSE2, can't you do CMP+CMOV? MOVD between GPRs and XMM registers is slow on a lot of CPUs.
Tobias Rieper:
Stupid blog software. Oddly enough, I did get your whole post in the email notification. Guess I'll have to go hack the output routine.
You're correct that the behavior of a right shift on a signed is not defined by C, but it is not undefined -- it is implementation defined. This means that you are OK doing this trick as long as you stipulate that your software needs to run on a 2's complement machine, which is the case for just about every current platform out there. (You're already in platform specific territory here anyway because of the need to know the bit size of type int.) I'm not sure what you're getting at with regard to C#, because sign extension can't occur for left shifts, and for signed types C# defines right shifts to sign extend, just like is common in C. In fact, just about all of these tricks will work fine in C#, although I'd recommend wrapping them in unchecked() as necessary in case someone wants to throw the overflow switch.
Phaeron - 28 12 08 - 16:40
well i copied your code literally into vs c# and the results were completely wrong. i used the following code:
[PexMethod(MaxRuns = 1000000000, MaxRunsWithoutNewTests = 1000000000, Timeout = 1000000000, MaxExecutionTreeNodes = 1000000000, MaxWorkingSet = 1000000000)]
public void TestMethod1(int a, int b)
{
int min1 = Math.Min(a, b);
int diff = a - b;
int min2 = b + (diff & (diff
Tobias Rieper - 28 12 08 - 17:10
aw man i forgot the left shift chars. the post was meant to be:
well i copied your code literally into vs c# and the results were completely wrong. i used the following code:
[PexMethod(MaxRuns = 1000000000, MaxRunsWithoutNewTests = 1000000000, Timeout = 1000000000, MaxExecutionTreeNodes = 1000000000, MaxWorkingSet = 1000000000)]
public void TestMethod1(int a, int b)
{
int min1 = Math.Min(a, b);
int diff = a - b;
int min2 = b + (diff & (diff LSHIFT 31));
PexAssert.AreEqual(min1, min2);
}
for a = 0, b = 1, i get min2 = -2147...
what am i missing?
btw, i was using microsoft pex to test the algorithm for all possible inputs. thats where the strange attribute comes from.
Tobias Rieper - 28 12 08 - 17:14
You missed the last paragraph of my post. The left shift has to be changed to an arithmetic right shift for the code to actually work. :)
Phaeron - 28 12 08 - 17:31
@Tobias: Actually, you're not the first person to consider such a program... the "superoptimizer." Mind you, that's looking into the shortest byte sequence, not fastest execution time, but there's a good correlation between the two. On the upside, a superoptimizer would not have been satisfied with the example in the original post, because the shift would be expressed in less than 6 bytes, and would thus be preferable. :)
Lorraine - 28 12 08 - 19:48
It's been done:
http://directory.fsf.org/project/superop..
One of the most obvious problems with such an algorithm is, of course, running time. It can take a long time to get even a moderate size routine optimized in this way. The bigger and more subtle problem, though, is that real gains from optimizing at this level usually require restricting the problem domain in order to support non-general optimizations. For instance, if I simply told you that I needed a fast minimum or maximum function, the algorithm described here would be an invalid optimization because it doesn't work over the full 32-bit range -- it fails whenever the signed difference of the two overflows. This is fairly common for integer tricks, and at the very least I'd have to tell you beforehand that the integers had a restricted range. A more realistic scenario, which is even harder for a tool to accommodate, is that I'd have to be told that such an optimization was available so I could see if the surrounding code could be restructured to accommodate it.
Phaeron - 28 12 08 - 20:53
If the flaw exhibits only on wrong (not practical) algorithm, and the code is at least not wrong (i.e. not affecting the result), then can't it be considered as not a bug? After all we're talking about optimizer, which should output best code on most likely usage, not corner case. I'm not sure whether this matter would be triggered in real code or not, though.
sbb edx, edx - 31 12 08 - 02:06
It's certainly not a bug I'm losing sleep over, but it's sufficiently weird that I'd call it a bug anyway. As it turns out, this only triggers for a shift of 31, so it is rare. I can't think of any circumstance in which an IMUL would be advantageous here, given that the src/dst registers are the same and that SHL always has at least as much throughput and same or lower latency.
As for the pattern in question, yes, it's incorrect in this case, but it's simple enough that I could see it occuring in production code, especially if macros or generated code are involved. The essential elements appear to be an add/subtract in the common subexpression and a logical AND or OR with a 31-bit left shift. A check if a difference or sum is odd or negative is not that unusual of a check to imagine.
I should note that I posted this more out of curiosity rather than a gripe. There are definitely much worse optimizer bugs to worry about, such as the bug where _m128 return values aren't always aligned properly....
Phaeron - 31 12 08 - 03:02
> Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.
Trusting the compiler to do its job isn't laziness, even if sometimes compilers don't live up to it...
Also odd: if you replace the left shift with an actual multiply, you get the SHL.
return p1 + ((p0 - p1) & ((p0 - p1) * 2147483648));
Glenn Maynard - 04 01 09 - 18:27
> > Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.
> Trusting the compiler to do its job isn't laziness, even if sometimes compilers don't live up to it...
I might be more willing to accept that if so many optimization algorithms weren't NP time, forcing compilers to do a sub-optimal job for practical reasons. Few compiler vendors provide adequate guarantees as to when optimizations will be applied or provide annotations you can use to control or enforce them.
Phaeron - 04 01 09 - 18:44
"By the way, why were you doing min/max through XMM registers? If you have SSE2, can't you do CMP+CMOV? MOVD between GPRs and XMM registers is slow on a lot of CPUs"
No assembly, I'm only using intrinsics, it's easier to compile for x86 and x64 at the same time.
Gabest - 07 01 09 - 19:26
The compiler's not as dumb as you think - multiply is faster than shift on most Pentium 4s since they don't have a barrel shifter. (I think the Prescott/Presler cores fixed that, but ugh).
Ian S. - 28 01 09 - 19:43
I have never heard of multiply being recommended above shift for P4, and the Pentopt tables say that multiply has considerably worse latency than shifts. It is true that a couple of ADDs is faster than a literal scalar integer left shift, however.
Phaeron - 29 01 09 - 01:09