Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Calendar

« October 2012 »
S M T W T F S
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Archives

01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ SSE4 finally adds dot products

I've recently been reading the specs for the SSE4 extensions that Intel is adding to Penryn, the successor to their Core 2 Duo CPU. It looks very interesting and more extensive than the small additions made in Prescott (SSE3; mainly horizontal add/subtract) and Core 2 Duo (SSSE3; mainly absolute value and double-width align). One thing stood out to me immediately, which is that Intel is re-introducing implicit special registers into SSE -- specifically, the variable blend instructions use XMM0 as an implicit third argument, since the instruction format normally only accommodates two-address opcodes. Argh. Okay, I can deal with that, since there are already precedents in IMUL, SHLD/SHRD, and MASKMOVQ. But there was something else that stood out.

They're finally giving us dot product instructions.

This may not seem special, but after writing graphics vertex shaders, the inflexible programming model of SSE seems very constraining. A dot product, or inner product, computes the sum of the component-wise products of two vectors -- that is, for (ax, ay, az) and (bx, by, bz), the dot product result is ax*bx + ay*by + az*bz. This is useful for a number of geometric and signal processing operations, but one very common use is transformation of a vector by a matrix for 3D graphics. Now, on a GPU, both multiply-add (mad) and dot product (dp2/dp3/dp4) are one-clock instructions. This means that you have complete freedom to choose either row-major or column-major storage for transform matrices, because the only change is whether you use a series of mads or dp4s to do the transformation. In SSE code, though, matrix transformations are more awkward, meaning that the matrix layout is constrained by the more efficient path in the assembly code. If you're storing row-major matrices, then you splat the vectors and do multiply-adds; if you're storing column-major, then you multiply and do horizontal adds, or junk the routine and make it row-major because you don't have SSE3. The new DPPS and DPPD instructions allow the CPU to more closely mimic the GPU algorithm, which is nicer from a hand-coding standpoint, and much nicer to a vertex shader emulator.

The new DPPS and DPPD instructions aren't just nice, however, but they're also unusually flexible. I had expected DPPS and DPPD to just be straight 4-D and 2-D dot products that right-align or splat the result, but they're a lot more featureful than that due to an immediate argument. You can mask out some or all components in the dot product as well as some or all of the destination components. For instance, you can compute (ax*bx + az*bz) and store it into Y and W of the output, with X and Z being zero. Among all of the possibilities, this allows DPPS to compute 2-D, 3-D, and 4-D dot products. Cool!

When I thought about it some more, though, it wasn't as impressive as it first seemed:

Let's take the matrix transform first -- specifically, a 4x4 transform with column-major layout. The components of the result vector from transforming a vector v by such a matrix M are the dot product of v by each of the rows of M, so in SSE3 this takes four MULPS instructions and three HADDPS instructions, with all four dot products happening in parallel in the HADDPSes. With DPPS, you can do the dot products directly... but you still have to merge the results using three ADDPS or ORPS instructions, since DPPS only allows you to place zeroes with the destination mask, not merge the result with other vector. That means that in the end, you're still at seven instructions. In fact, I don't think it's possible to do it in fewer than seven instructions, no matter what the instructions, as long as you are constrained to two-address opcodes. No matter what, you need four instructions to compute intermediate results, and three to merge those together (since you always eliminate one intermediate per merge instruction). You do win in the 4x3 case by one instruction, though... but that assumes that DPPS is as cheap as MULPS, which may not be the case.

The destination mask is neat, but you probably still need to merge something other than zero into those masked channels. For instance, if you're only writing to XYZ, you probably want 1 in W and not 0. Thing is, all of the cases I could think of that used the destination mask followed by ADDPS/ORPS were just as easily handled by BLENDPS, which selectively copies channels using an immediate mask. You don't save instruction count here, so the remaining benefit would be if ADDPS/ORPS were cheaper than BLENDPS, which could be the case. In a vertex shader, just about every instruction has a destination write mask, so this blend operation is free -- you can just have successive dp4 instructions write into r0.x, r0.y, r0.z, and r0.w. Unfortunately, we're still not there yet in x86.

By the way, the instruction encodings are getting a bit ridiculous: DPPS is 66 0F 3A 40 /r ib, for a total of six bytes. I'm sure glad I didn't hardcode my disassembler to one- and two-byte opcodes. This makes PowerPC machine code look compact, too.

It's worth noting that I haven't tried SSE4, not even with the software emulator that Intel supposedly has in the Penryn SDK, so I haven't really pounded on the instructions yet. The packed move-with-zero-extension instructions alone could probably significantly shorten a number of VirtualDub's image processing routines, since I spend a lot of time unpacking 32-bit pixels into 64-bit intermediates -- assuming, of course, PMOVZXBW isn't just immediately split into load and unpack uops by the instruction decoder. The thing is, even though SSE4 is more significant to what I do than SSE3 and SSSE3 were, it still isn't anywhere on the magnitude of MMX and SSE, which added major new functionality to the ISA. It also doesn't help with the people who don't have SSE4... heck, I still assume that some people don't have MMX, and in commercial software you can't generally assume more than SSE right now unless you're really aiming for the high end.

(Read more....)

§ Working around display brain damage in Windows Vista

I've been struggling with video display issues in VirtualDub under Windows Vista for a while now, as some of you may know. I hit a couple of snags during the beta, one of which was due to a DirectDraw implementation issue in the OS that was fixed in RTM. 1.6.17 works decently well in Vista, fortunately. However, as I've optimized and reworked the display code in 1.7.x, I'm finding that I'm hitting a lot of weird issues in Window Vista again that I wasn't seeing in Windows XP. I spent part of last weekend fighting these again in another fit of frustration over things not displaying when they should.

When I see the exact same issues on two machines running Vista, one with an NVIDIA card and one with an ATI card, I'm inclined to believe it's Microsoft's fault.

The first problem, which I've mentioned before, has to do with DirectDraw hardware video overlays -- these are essentially secondary displays that are composited on top of the main one in the video scanout hardware itself. Yeah, yeah, Microsoft's been saying that video overlays are outdated... but they're the only widely available way to do hardware YCbCr color conversion for commonly used formats without requiring 3D pixel shaders of some sort. You'd be hard pressed to find a system out there with a resolution greater than 800x600 that doesn't support a YUY2 overlay. Well, the problem is that Vista will happily let you create a hardware overlay surface, populate it, and show it -- without actually displaying anything. Your program thinks its happily displaying video, when the user is actually seeing green, magenta, or whatever you use for your colorkey. Lame. I worked around this in VirtualDub 1.7 by calling DwmIsCompositionEnabled() if it is available, and forcibly disabling overlays if the DWM is compositing.

The second problem is more insidious. For various reasons, I'm moving the display code to a separate thread in 1.7.2, and this is exposing a lot of weird threading issues in Windows, like the HTTRANSPARENT issue I mentioned earlier. Well, another problem I hit is that the DWM doesn't seem to consistently update its composition tree when you have a child 3D window in another thread -- you can call Present() in Direct3D or SwapBuffers() in OpenGL, and nothing shows up. In fact, you get junk from underneath the window. I beat my head against the desk for hours trying to figure this out, and made the following conclusions:

The solution I finally came up with was to call SetWindowPos() with the SWP_FRAMECHANGED message after the first call to Present() or SwapBuffers(). This seems like an utterly bogus solution, and I see a frame of garbage whenever the D3D or OpenGL minidriver reinitializes, but in the absence of any better solution or any diagnostics to determine what's really going wrong, it's the best I can do. Sigh.

I think the most astonishing part to me is how Microsoft can form a movement to get applications "Vista compatible" -- when in reality what they've done is broken a lot of apps and asked the vendors to pick up the pieces. Sure, some apps were doing really broken things, but I'm just trying to use Direct3D to display video....

(Read more....)