## § ¶SSE4 finally adds dot products

I've recently been reading the specs for the SSE4 extensions that Intel is adding to *Penryn*, the successor to their Core 2 Duo CPU. It looks very interesting and more extensive than the small additions made in Prescott (SSE3; mainly horizontal add/subtract) and Core 2 Duo (SSSE3; mainly absolute value and double-width align). One thing stood out to me immediately, which is that Intel is re-introducing implicit special registers into SSE -- specifically, the variable blend instructions use XMM0 as an implicit third argument, since the instruction format normally only accommodates two-address opcodes. Argh. Okay, I can deal with that, since there are already precedents in IMUL, SHLD/SHRD, and MASKMOVQ. But there was something else that stood out.

**They're finally giving us dot product instructions.**

This may not seem special, but after writing graphics vertex shaders, the inflexible programming model of SSE seems very constraining. A dot product, or inner product, computes the sum of the component-wise products of two vectors -- that is, for (ax, ay, az) and (bx, by, bz), the dot product result is ax*bx + ay*by + az*bz. This is useful for a number of geometric and signal processing operations, but one very common use is transformation of a vector by a matrix for 3D graphics. Now, on a GPU, both multiply-add (mad) and dot product (dp2/dp3/dp4) are one-clock instructions. This means that you have complete freedom to choose either row-major or column-major storage for transform matrices, because the only change is whether you use a series of mads or dp4s to do the transformation. In SSE code, though, matrix transformations are more awkward, meaning that the matrix layout is constrained by the more efficient path in the assembly code. If you're storing row-major matrices, then you splat the vectors and do multiply-adds; if you're storing column-major, then you multiply and do horizontal adds, or junk the routine and make it row-major because you don't have SSE3. The new DPPS and DPPD instructions allow the CPU to more closely mimic the GPU algorithm, which is nicer from a hand-coding standpoint, and much nicer to a vertex shader emulator.

The new DPPS and DPPD instructions aren't just nice, however, but they're also unusually flexible. I had expected DPPS and DPPD to just be straight 4-D and 2-D dot products that right-align or splat the result, but they're a lot more featureful than that due to an immediate argument. You can mask out some or all components in the dot product as well as some or all of the destination components. For instance, you can compute (ax*bx + az*bz) and store it into Y and W of the output, with X and Z being zero. Among all of the possibilities, this allows DPPS to compute 2-D, 3-D, and 4-D dot products. Cool!

When I thought about it some more, though, it wasn't as impressive as it first seemed:

- You don't necessarily save on instructions in a matrix transform.
- You don't necessarily save on instructions with the destination mask.

Let's take the matrix transform first -- specifically, a 4x4 transform with column-major layout. The components of the result vector from transforming a vector v by such a matrix M are the dot product of v by each of the rows of M, so in SSE3 this takes four MULPS instructions and three HADDPS instructions, with all four dot products happening in parallel in the HADDPSes. With DPPS, you can do the dot products directly... but you still have to merge the results using three ADDPS or ORPS instructions, since DPPS only allows you to place zeroes with the destination mask, not merge the result with other vector. That means that in the end, you're still at seven instructions. In fact, I don't think it's possible to do it in fewer than seven instructions, no matter what the instructions, as long as you are constrained to two-address opcodes. No matter what, you need four instructions to compute intermediate results, and three to merge those together (since you always eliminate one intermediate per merge instruction). You do win in the 4x3 case by one instruction, though... but that assumes that DPPS is as cheap as MULPS, which may not be the case.

The destination mask is neat, but you probably still need to merge something other than zero into those masked channels. For instance, if you're only writing to XYZ, you probably want 1 in W and not 0. Thing is, all of the cases I could think of that used the destination mask followed by ADDPS/ORPS were just as easily handled by BLENDPS, which selectively copies channels using an immediate mask. You don't save instruction count here, so the remaining benefit would be if ADDPS/ORPS were cheaper than BLENDPS, which could be the case. In a vertex shader, just about every instruction has a destination write mask, so this blend operation is free -- you can just have successive dp4 instructions write into r0.x, r0.y, r0.z, and r0.w. Unfortunately, we're still not there yet in x86.

By the way, the instruction encodings are getting a bit ridiculous: DPPS is 66 0F 3A 40 /r ib, for a total of six bytes. I'm sure glad I didn't hardcode my disassembler to one- and two-byte opcodes. This makes PowerPC machine code look compact, too.

It's worth noting that I haven't tried SSE4, not even with the software emulator that Intel supposedly has in the Penryn SDK, so I haven't really pounded on the instructions yet. The packed move-with-zero-extension instructions alone could probably significantly shorten a number of VirtualDub's image processing routines, since I spend a lot of time unpacking 32-bit pixels into 64-bit intermediates -- assuming, of course, PMOVZXBW isn't just immediately split into load and unpack uops by the instruction decoder. The thing is, even though SSE4 is more significant to what I do than SSE3 and SSSE3 were, it still isn't anywhere on the magnitude of MMX and SSE, which added **major** new functionality to the ISA. It also doesn't help with the people who don't have SSE4... heck, I still assume that some people don't have MMX, and in commercial software you can't generally assume more than SSE right now unless you're really aiming for the high end.