Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ SSE4 finally adds dot products

I've recently been reading the specs for the SSE4 extensions that Intel is adding to Penryn, the successor to their Core 2 Duo CPU. It looks very interesting and more extensive than the small additions made in Prescott (SSE3; mainly horizontal add/subtract) and Core 2 Duo (SSSE3; mainly absolute value and double-width align). One thing stood out to me immediately, which is that Intel is re-introducing implicit special registers into SSE -- specifically, the variable blend instructions use XMM0 as an implicit third argument, since the instruction format normally only accommodates two-address opcodes. Argh. Okay, I can deal with that, since there are already precedents in IMUL, SHLD/SHRD, and MASKMOVQ. But there was something else that stood out.

They're finally giving us dot product instructions.

This may not seem special, but after writing graphics vertex shaders, the inflexible programming model of SSE seems very constraining. A dot product, or inner product, computes the sum of the component-wise products of two vectors -- that is, for (ax, ay, az) and (bx, by, bz), the dot product result is ax*bx + ay*by + az*bz. This is useful for a number of geometric and signal processing operations, but one very common use is transformation of a vector by a matrix for 3D graphics. Now, on a GPU, both multiply-add (mad) and dot product (dp2/dp3/dp4) are one-clock instructions. This means that you have complete freedom to choose either row-major or column-major storage for transform matrices, because the only change is whether you use a series of mads or dp4s to do the transformation. In SSE code, though, matrix transformations are more awkward, meaning that the matrix layout is constrained by the more efficient path in the assembly code. If you're storing row-major matrices, then you splat the vectors and do multiply-adds; if you're storing column-major, then you multiply and do horizontal adds, or junk the routine and make it row-major because you don't have SSE3. The new DPPS and DPPD instructions allow the CPU to more closely mimic the GPU algorithm, which is nicer from a hand-coding standpoint, and much nicer to a vertex shader emulator.

The new DPPS and DPPD instructions aren't just nice, however, but they're also unusually flexible. I had expected DPPS and DPPD to just be straight 4-D and 2-D dot products that right-align or splat the result, but they're a lot more featureful than that due to an immediate argument. You can mask out some or all components in the dot product as well as some or all of the destination components. For instance, you can compute (ax*bx + az*bz) and store it into Y and W of the output, with X and Z being zero. Among all of the possibilities, this allows DPPS to compute 2-D, 3-D, and 4-D dot products. Cool!

When I thought about it some more, though, it wasn't as impressive as it first seemed:

Let's take the matrix transform first -- specifically, a 4x4 transform with column-major layout. The components of the result vector from transforming a vector v by such a matrix M are the dot product of v by each of the rows of M, so in SSE3 this takes four MULPS instructions and three HADDPS instructions, with all four dot products happening in parallel in the HADDPSes. With DPPS, you can do the dot products directly... but you still have to merge the results using three ADDPS or ORPS instructions, since DPPS only allows you to place zeroes with the destination mask, not merge the result with other vector. That means that in the end, you're still at seven instructions. In fact, I don't think it's possible to do it in fewer than seven instructions, no matter what the instructions, as long as you are constrained to two-address opcodes. No matter what, you need four instructions to compute intermediate results, and three to merge those together (since you always eliminate one intermediate per merge instruction). You do win in the 4x3 case by one instruction, though... but that assumes that DPPS is as cheap as MULPS, which may not be the case.

The destination mask is neat, but you probably still need to merge something other than zero into those masked channels. For instance, if you're only writing to XYZ, you probably want 1 in W and not 0. Thing is, all of the cases I could think of that used the destination mask followed by ADDPS/ORPS were just as easily handled by BLENDPS, which selectively copies channels using an immediate mask. You don't save instruction count here, so the remaining benefit would be if ADDPS/ORPS were cheaper than BLENDPS, which could be the case. In a vertex shader, just about every instruction has a destination write mask, so this blend operation is free -- you can just have successive dp4 instructions write into r0.x, r0.y, r0.z, and r0.w. Unfortunately, we're still not there yet in x86.

By the way, the instruction encodings are getting a bit ridiculous: DPPS is 66 0F 3A 40 /r ib, for a total of six bytes. I'm sure glad I didn't hardcode my disassembler to one- and two-byte opcodes. This makes PowerPC machine code look compact, too.

It's worth noting that I haven't tried SSE4, not even with the software emulator that Intel supposedly has in the Penryn SDK, so I haven't really pounded on the instructions yet. The packed move-with-zero-extension instructions alone could probably significantly shorten a number of VirtualDub's image processing routines, since I spend a lot of time unpacking 32-bit pixels into 64-bit intermediates -- assuming, of course, PMOVZXBW isn't just immediately split into load and unpack uops by the instruction decoder. The thing is, even though SSE4 is more significant to what I do than SSE3 and SSSE3 were, it still isn't anywhere on the magnitude of MMX and SSE, which added major new functionality to the ISA. It also doesn't help with the people who don't have SSE4... heck, I still assume that some people don't have MMX, and in commercial software you can't generally assume more than SSE right now unless you're really aiming for the high end.


Comments posted:

...and even then, only 'integer SSE': those 950MHz AMD Duron are still kicking here and there...

Mitch 74 - 19 04 07 - 04:12

What I don't understand is: why all these extensions if you can't be shure that every cpu has it even after 10 years? How can someone write optimized code if he's to deal with all these exceptions?
I think we all would be better off with a simple risc processor like and 68000. No assembler codes like 'PMOVZXBW' where nobody knows if the x in it is just a typo or by design.

McMurmel - 19 04 07 - 13:28

That's what code-paths are for, you can detect the CPU once and use different code for different CPUs in speed-critical tasks.

Blight - 19 04 07 - 15:09

Mitch 74:
Worse than that -- Socket A Athlons shipped up to 1.4GHz. You need to put at least 1.5GHz on the box to rely on full SSE by MHz alone.

Generally, only a small portion of your code is responsible for most of its execution time. The rule of thumb is that 90% of the CPU is spent in 10% of the code, but when beginning to optimize a video app, the split can be even more tilted than that. That means that spending a couple of days rewriting a ten-line routine can reap big dividends, and writing multiple code paths is thus very profitable. This is why saying that assembly language is dead on Windows is stupid, because the benefits can be so huge. The main problem is that the most advantageous CPU extensions are available on the CPUs that least need them!

As for the 68000, well, it's hardly a RISC chip when it has instructions like MOVE.L (A0)+, 4(A0, A4.L). Classic RISC doesn't just mean orthogonality, but also instructions that do only one thing at a time, generally load/store or arithmetic. MIPS is the classic example.

So far, I think PowerPC beats everything else I've seen for cryptic mneumonics, given ldu, stw, and rlwinm. Granted, they're all very systematic, but looking at it you'd almost think the designer hated vowels. And sadly, there's no store float with update instruction (which would have been stfu).

Phaeron - 20 04 07 - 00:42

I don't see the necessity of the second part of your argument. You say that instructions execute faster in your first half (and other outlets, notably ArsTechnica confirm this with some older instructions going from 5 to 3 cycles and a lot being 1). Then you switch gears and talk about how much code it takes, concluding that it takes just as much. That is not a counter argument, if it takes as much code, but each line executes faster, it's still just as much an improvement in speed as if it took less instructions. True you don't gain in code storage department, but you don't loose either.

Could you clarify your argument about code size?

Lexor - 20 04 07 - 09:56

The powerpc may not have stfu but it does have eieio :)

Monolith - 20 04 07 - 17:29


The thing is, that the relative costs aren't likely to change much. Elementary operations such as add, multiply, and logical-or are among the cheapest operations in the SSE unit and will probably stay that way regardless of global optimizations Intel does. Similarly, as HADDPS is a subset of DPPS, I'd expect it to be the same cost or cheaper. When estimating the performance of a routine, it's useful to count instructions, assuming that instructions within the same class are comparable in cost.

Let's say that the MULPS and ORPS are simple instructions, whereas HADDPS and DPPS are more costly. In the 4x4 matmul example I gave above, you'd be replacing four cheap instructions and three slower ones with three cheap instructions and four slower ones. That's not likely to be a win unless DPPS is cheaper than HADDPS (could be, but unlikely), or unless there's another factor in play such a better dependency graph.

This isn't to say that DPPS is useless -- definitely not. Intel doesn't add new instructions for no reason. DPPS may be sufficiently slow, however, that some dot-product-like cases are better off still being done with elementary operations. We won't know until people start pounding on actual silicon.

As for the actual byte size of DPPS itself, that's not much of an issue, given the instruction cache and the enhanced fetcher since the Pentium M, which eliminated some alignment penalties. It's more of a problem of Intel adding to the mess that is already x86 instruction decoding with the new three-byte opcodes. You can decode PPC opcodes in one small routine, but it takes a lot more than that to parse all of x86.

Phaeron - 21 04 07 - 00:33

The SDK is 14 megs, should anyone care to download it :)

Carter - 23 04 07 - 11:12

I may not know too much about this, uh, programming stuff, but it sounds like you're programming in ASM rather than C, which I assume VDub is written in.

Oh, BTW, I've been meaning to ask this for a while (I really miss VDubMod), but are you ever going to put mkv support into VirtualDub?

randomshinichi (link) - 24 04 07 - 12:02

You can always demux the stream of an mkv file and remux them after. It's not that hard to do and there are already enough good tools around to work with those. Also, many features appear on mkv and it would be hard to maintain VirtualDub with those everytime. On the other hand, this tool get updated often so you don't have to bang your head everytime :

Simbou - 01 05 07 - 17:05

The help system works in Vista. You are very thorough!

Ashtonian (link) - 06 05 07 - 15:29

The packed move-with-zero-extension instructions actually seem to be slower than pshufb (SSSE3) according to my initial testing.

As for DPPS, there should have been third implicit operand (like for blendps).

What I am really surprised is that they decided to chose wrong polynomial for SSE4.2 CRC32 instruction.

Igor (link) - 07 01 08 - 21:04

Comment form