¶CPUs and floating-point math
(Now playing: Rumbling Hearts, Kimi Ga Nozomu Eien game OST)
Some questions were asked in comments in the previous article, and I decided it would be easier to answer here instead of in comments. The two questions were about the origin of the Intel Pentium-M and about AMD's chips and 3DNow! in general.
I get the feeling that I should probably be posting links to reputable hardware sites instead of pulling this info out of my (null), but it's either this or I lament about how I wish I had eight hours a day and infinite patience to try Final Fantasy XI Online again. Let me know if I've made dumb errors in the writeup.
About the Pentium-M:
IIRC, the original Pentium-M design (Banias) was not acquired by Intel; it was created by an Intel design team in Israel based on the Pentium III design. Among the improvements were to slow down parts of the chip that were too fast, to save on gates and power, and a quadrant scheme for a power efficient L2 cache. Banias also got SSE2 support and an upgraded decoder with micro-ops fusion, which as I understand it means that the D1 and D2 decoders, which could previously only decode 1 uop instructions, can now decode load+store ops. As I said I don't have a Pentium-M to play with, but everything I seen so far indicates that the P-M team is kicking major @&(#$* which explains why they are being given the reins from the Pentium 4 team. Props also have to be given to the original Pentium Pro designers, whose basic design still lives on!
About AMD chips:
I completely skipped the Athlon and Athlon XP series of chips; they looked interesting, but what really put me off was the bad series of support chipsets, most notably the VIA north/southbridges. Stability problems are hellish to diagnose and the last thing I needed or wanted was to put up with hardware conflicts. Also, I've never been much of a high-end gamer and the 3D card I had at the time was an NVIDIA TNT2 Ultra, so I didn't really need the CPU speed anyway.
I did used to have an AMD K6 233 (that was underclocked to 200MHz) in my Linux server. Tough chip. One day the server started crashing, and I thought it was my packet shaper changes, so I tried recompiling them out of the kernel, and the sucker kept sig11ing in gcc. I ran memtest86 for a while to no avail, until I realized the top of the case was rather warm... and found out the CPU fan had stopped turning. Whoops. A new CPU fan later, the system still works to this day.
Now, about floating point and 3DNow!:
(The reference for this section is Paul Hsieh's 6th generation x86 CPU Comparisons, http://www.azillionmonkeys.com/qed/cpuwar.html).
Roll back in time to the days of around 300MHz. There was no question that in the x86 world, Intel was blowing everyone else away in FPU performance. The Pentium and Pentium II FPUs were fully pipelined and could pump out many results at a rate of one per clock; K6 was further back at one per every two clocks, and Cyrix was at a distant third with somewhere between 4-8 clocks per result. Part of the problem was the annoying x87 stack architecture, which required (requires) you to write optimized code like this:
fld x0 ;x0
fld y0 ;y0 x0
fld z0 ;z0 y0 x0
fld w0 ;w0 z0 y0 x0
fld mat00 ;mat00 w0 z0 y0 x0
fmul st, st(4) ;(mat00*x0) w0 z0 y0 x0
fld mat01 ;mat01 (mat00*x0) w0 z0 y0 x0
fmul st, st(4) ;(mat01*y0) (mat00*x0) w0 z0 y0 x0
fld mat02 ;mat02 (mat01*y0) (mat00*x0) w0 z0 y0 x0
fmul st, st(4) ;(mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
fld mat03 ;mat03 (mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
fmul st, st(4) ;(mat03*w0) (mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
fxch st(2) ;(mat01*y0) (mat02*z0) (mat03*w0) (mat00*x0) w0 z0 y0 x0
faddp st(3), st ;(mat02*z0) (mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fadd ;(mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fld mat10 ;mat10 (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fmul st, st(6) ;(mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fld mat11 ;mat11 (mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fmul st, st(6) ;(mat11*y0) (mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
fxch st(2) ;(mat02*z0+mat03*w0) (mat10*x0) (mat11*y0) (mat00*x0+mat01*y0) w0 z0 y0 x0
faddp st(3), st ;(mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
fld mat12 ;mat12 (mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
fmul st, st(5) ;(mat12*z0) (mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
fxch st(2) ;(mat10*x0) (mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
fadd ;(mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
fld mat13 ;mat13 (mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
fmul st, st(4) ;(mat13*z0) (mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
fxch st(3) ;x_result (mat10*x0+mat11*y0) (mat12*z0) (mat13*z0) w0 z0 y0 x0
fstp x_result ;(mat10*x0+mat11*y0) (mat12*z0) (mat13*z0) w0 z0 y0 x0
You're probably cringing upon seeing this, and rightfully so -- the stack-oriented code was difficult for CPUs to execute and for compilers to schedule, and error-prone to write by hand.
3DNow! was first available on AMD K6-2 CPUs and overlaid 2-vector, single-precision floating-point operations on top of the MMX registers. This had several advantages:
* No stupid stack, just flat registers.
* One-clock throughput instead of two, *AND* two results at a time, for a 4x peak improvement.
* Really fast reciprocal and reciprocal square root approximations, and horizontal add instructions.
* Mixed vector integer (MMX) and floating-point (3DNow!) instructions. Hello, software texture mapping and goofy float bit pattern tricks!
This meant that a K6-2 could now outrun the Pentium II in floating-point performance, given appropriately optimized code. There was just one big problem.
Intel CPUs didn't support 3DNow!.
Now, I own an Athlon 64 and admire the performance of AMD architectures, but let's be clear. There were, and are, a lot more Intel CPUs out there than AMD CPUs. This meant that not only was there the problem of adoption lag -- it took a while for MMX and SSE to be used, as well -- but most of the CPUs out there didn't have 3DNow! at all. This, combined with the fact that using 3DNow! meant writing yet another CPU specific path (there were no compiler intrinsics for it at the time) didn't bode well for its adoption. Also, much like SSE, MMX held a 4:1 throughput advantage over 3DNow!, so it wasn't worth using if you didn't need floating-point range or accuracy.
I've written some sound code in 3DNow! before, and it's pretty nice -- the 2-vector form is just right for stereo audio. The main problem is that you tend to run out of registers pretty quickly because audio filters are generally longer than video filters. With some creative register juggling you can compute a 12-point IIR filter entirely in SSE registers, but there just isn't enough register space with 3DNow!.
AMD eventually got tired of their chip sucking at 3D and came out with the Athlon, which unlike the K6, was a monster in floating-point: instead of one result per two cycles, the Athlon could produce two results per cycle... on scalar x87 code. I suspect that at this point the temptation to use 3DNow! simply drained away, because it was a lot easier to use the Athlon's muscle on the x87 unit, where you didn't have to worry about CPU-specific code or FPU/MMX register file switches.
Floating-point support is still quite annoying in the x86 CPU world due to the various mismatches. The Intel Pentium III supports SSE but not 3DNow!, much to the annoyance of AMD. AMD's original Athlon is available up to 1.4GHz and supports 3DNow! but not SSE, much to the annoyance of Intel. The Pentium 4 supports SSE but is slower at scalar SSE operations than x87, much to the annoyance of everyone else. As a result most programs simply use standard x87 and don't use the optimized FP instructions of any CPU.
The situation is a lot cleaner on AMD64 (x64), where both the Intel Xeons with EM64T and the Athlon 64 both support SSE/SSE2, which is the standard for floating-point on that platform, and Microsoft has effectively banned the use of MMX and x87 by threatening not to save/restore the FPU register file.