¶Borland compilers and floating-point exceptions
One of the difficulties about releasing a program into the wild is that sometimes you get reports of crashes that are simply weird and don't make any sense at first. Take this crash, for instance:
Disassembly: 0043eba0: d8afb0000000 fsubr dword ptr [edi+b0] <-- FAULT Crash reason: FP Invalid Operation
I used to get crashes like this periodically, mostly in audio codecs, and for the longest time couldn't figure out what was happening. The users reporting the problem could not reproduce it on demand and I had never seen it myself. Which made it a bit difficult to diagnose the problem much less fix or work around it. So basically, I had to file the problem into the Could Not Reproduce file and keep going.
The problems, as it turns out, were caused by completely unrelated video codecs that had been compiled with the Borland C/C++ compiler. These codecs weren't actively being used at the time, but merely having them installed was enough to trip the problem. It took me a while to understand what was going on.
It wasn't until I looked really closely at the crash reports that I realized what was going on. The tipoff turned out to be this line:
FPUCW = ffff1270
FPUCW stands for Floating Point Unit Control Word. Only the low 16 bits of this register are important, and under Windows it has the default value of 027F. At the time of the crash, however, it was 1270. Bit 12 being set doesn't matter, but bits 0-3 being cleared are really important as those are the masking bits for the overflow, zero-divide, denormal, and invalid operation exceptions. Clearing those bits enables exceptions that would otherwise be masked (not occur), and since Win32 code doesn't usually expect such exceptions, this results in a fatal crash.
Tripping floating-point exceptions
Floating-point exceptions are, as you might expect by virtue of the name, rare. The ones that happen most commonly by mistake in my experience are the zero-divide and invalid operation exceptions. Zero divide tends to happen whenever you have an unchecked normalization operation, such as resetting a 2D or 3D vector to unit length — which works fine, until someone hands you a vector of length zero. Another example would be trying to normalize a portion of audio that was totally silent. When the zero-divide exception is masked, the FPU spits out a signed infinity instead, which sometimes works out in the end. For instance, if the expression is of the form |x/y| > n, then the infinity would give you the correct result.
Invalid operation exceptions are more serious and result from operations that don't have a graceful way to degrade, such as 0/0, the square root of -1, etc. These too often result from the lack of bounds checks. For instance, a common way to determine the angle between two vectors is through dot product, since the shortest angle between two vectors is acos(dot(v1 / |v1|, v2 / |v2|)). Unfortunately, the common way of normalizing vectors is to multiply by the reciprocal square root of the squared length (dot(v,v)), which can give you a not-quite-unit-length vector since the squaring operation discards half of the usual precision. This can then lead to taking the arccosine of a number slightly larger than 1. When such an operation occurs and invalid operation exceptions are masked, the FPU spits out a Not a Number (NaN) value and keeps going. You can also trip such an exception by trying to operate on NaNs, especially by loading garbage data that isn't a valid IEEE finite number.
In general, you don't want to be tripping floating-point exceptions, even if they are masked. The reason is that when the FPU hits one, the fast hardware can't handle it and punts to the microcode, which then takes about twenty times longer. This is especially bad with NaNs since any operation with a NaN produces another NaN, causing them to spread throughout your calculations (NaN disease) and slow down everything massively. You can even crash due to NaNs blowing past clamp expressions, since any comparison with a NaN is false and converting one to integer form results in integer indefinite (0x80000000). Despite the erroneous results, though, NaNs can appear sporadically in a large Win32 program without anyone knowing, and may go unnoticed in a code base for years.
Note that although exceptions are really slow and usually indicate mistakes, the results when the exceptions are masked are well-defined. It is possible, and sometimes reasonable, to actually depend on and test for specific results from masked exceptions. So it isn't valid to simply say "don't do that."
How Borland C/C++ factors into the picture
The Borland DLL run-time library, as it turns out, enables floating-point exceptions on initialization. This happens even if you simply load the DLL! Because Windows programs generally don't touch the floating-point control word, the effects of this can persist long after the DLL has been unloaded. For instance, you could:
- launch an Open file dialog in a program and hover over an AVI file in the list,
- ...thus causing Explorer to load its shell media extension to retrieve the file's properties for the tooltip,
- ...causing a codec search for the video stream in the file,
- ...thus loading and unloading the Borland-compiled DLL.
It is possible to disable this behavior of the Borland run-time library and avoid this problem, but most people aren't aware of it, and unintentionally release DLLs that cause this issue. I have heard that DLLs built with Delphi can cause this problem as well. The best way to fix it is to not modify the control word, but I don't know if that is possible; barring that, a usable workaround is to remask the exceptions with _controlfp(), as noted at http://homepages.borland.com/ccalvert/TechPapers/FloatingPoint.html.
It's not just me, either. The Java bug database has an interesting incident where loading a Delphi DLL caused the JVM to subsequently crash with a floating-point divide-by-zero exception: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4644270
Checking for this problem is easy. Execute sqrt(-1) after your DLL loads and see if it crashes.
VirtualDub contains a rather brute-force workaround for this problem: it wraps all calls to video codecs, audio codecs, and video filters with a pair of routines that checks for and fixes broken FPU/MMX state. This protects VirtualDub from having its floating-point calculations screwed up by a broken driver. It also works the other way, too — if I screw up, the FPU state will be reset before the external routine is invoked. 1.6.7 will be even more aggressive and will check for such issues whenever the primary message loop is idle.
It wouldn't be fair if I just knocked Borland for this problem. While DLLs built with Visual C++ don't commit this particular sin, Microsoft has committed a far worse one in the Direct3D API. Initializing Direct3D with default settings causes the precision bits in the floating-point control word to be reset such that FPU calculations always occur with 24-bit precision (single precision / float). This is much more serious as it causes roundoff errors to become much larger, and it means that double-precision math can no longer represent all valid values of a 32-bit int. For this reason, if you invoke Direct3D within an application that may not be expecting it, such as within a video filter, you should set the D3DCREATE_FPU_PRESERVE flag when creating the device. VirtualDub does this in its display code to ensure that the accuracy of its floating-point calculations is not disturbed.