Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Software pixel shader emulation in Windows Presentation Foundation (WPF)

Windows Presentation Foundation (WPF) gained an interesting feature in .NET Framework 3.5 SP1, which is the ability to execute pixel shader effects in software via a just-in-time (JIT) compiler. Issues with introducing features in service packs aside, this is a cool addition, since it allows the same pixel shader code to run on the GPU and the CPU with reasonable performance on the latter. It's certainly better than the old effects system, which only supported software mode and required you to write a custom routine in a separate DLL instead.

Of course, being a sort of graphics guy but not a .NET kind of person, I had to dig into the shader jitter....

Interface

The way you use a custom shader effect in WPF is via the System.Windows.Media.Effects namespace, deriving from ShaderEffect and attaching a PixelShader object to it. You attach a precompiled pixel shader to the PixelShader and then attach the ShaderEffect to a UI element. There are facilities for binding properties to samplers and float constants.

You can't get to the vertex shader, so you can't take advantage of the hardware interpolators and all precomputation will have to be done in C#. Samplers can be switched between point and bilinear sampling, but there are no mipmaps and no support for wrapping.

WPF allows you to select three modes for the pixel shader, Auto, HardwareOnly, and SoftwareOnly. When SoftwareOnly is used, WPF converts the pixel shader to SSE2-based code using just-in-time (JIT) compilation, which is what we're interested in here.

The documentation doesn't give a lot of direction or note gotchas and could use a lot of improvement, but I could say that about a lot of the .NET Framework docs. When I originally looked at the jitter I was stuck without the docs and had to wing it via Intellisense, but I when I got access to the docs again I was surprised to find that the docs still weren't a lot of help. There's a lot more useful information on Greg Schechter's blog about how to write custom ShaderEffects.

Validation

WPF does do some validation on the pixel shader, and will reject many shaders that are otherwise valid pixel shaders, even if you are running in hardware accelerated mode. First, your shader must use the vanilla ps_2_0 shader model ps_1_1, ps_1_4, ps_2_a, ps_2_sw, and ps_3_0 will all be rejected. Second, attempting to use some features that aren't supported by WPF will also be caught and rejected, such as reading from color interpolators (v0).

What you can do, however, is cheat somewhat by compiling to the ps_2_a, ps_2_b, or ps_2_sw targets and then hack the version token to ps_2_0 (FFFF0200). You won't get away with trying to use gradients or predication, but arbitrary swizzle does work. That makes sense, since arbitrary swizzles are easy to do in jitted code, and there isn't any special encoding in shader bytecode for extended swizzles vs. standard ps2.0 swizzles. Doing this also allows you to exceed the standard ps_2_0 limits. I take no responsibility if you do this and your code breaks with a future WPF update, though.

Code generation

The jitter requires SSE2. It probably could have been implemented with SSE1+MMX, but the performance probably would have been mediocre, the fastest chips in that range being Athlon XPs. If you're experienced with writing vectorized image processing code, you'll beat the jitter handily, but otherwise, it doesn't do a bad job. I did all experimentation on an SSE4.1-capable CPU, but didn't see any instructions used beyond SSE2 profile.

All pixel arithmetic is done in single precision using SSE. This may be a bit slower than could be done with fixed-point, but at least there are no precision or range surprises. One gotcha is that this means NaNs can also appear, which you may not be used to if you have a shader model 2 level ATI card.

The jitter reorganizes shaders into structures of arrays (SOA) form and executes pixel shaders for four pixels in parallel. This means that a single SSE register holds one component of a register across four pixels. For instance, xmm0 might hold r0.x for pixels 0-3, and a dp3 instruction would look like this:

mulps xmm0, xmm3
mulps xmm1, xmm4
mulps xmm2, xmm5
addps xmm0, xmm1
addps xmm0, xmm2

SOA form avoids a lot of swizzle traffic that would result from cross-component operations like a dot product, since SSE is poor at horizontal data traffic and doesn't have free swizzles or write masks like pixel shaders do. The downsides are much greater register pressure, particularly due to constant bloating, and more complex execution for non-naturally vector operations like table lookups. Pixel shader hardware does this too in a way, but the hardware does 2x2 quads, whereas the jitter does 4x1. The hardware needs to do quads in order to compute mipmapping parameters and gradients, but the jitter never deals with mipmapped textures.

Surprisingly, complex scalar operations are expanded inline: sincos turns into a series of muls and adds, and log is also emitted inline (although it is quite expensive). This is different from the Direct3D PSGP, which calls out to CRT transcendental functions instead when compiling vertex shaders.

There isn't a lot of optimization done on the shaders. If you manage to get four rcps in a row, they'll all get coded even if they cancel out. Ordinarily this isn't too much of a problem, since the HLSL compiler will do a lot of optimization for you. It does mean there are some cases that only the jitter can optimize and that it misses, such as a vector multiply where two out of three components are multiplied by constant zero. The jitter will strip dead stores and remove redundant moves, though.

Texture sampling is very slow, as the jitter generates several pages of machine code for every texld instruction. I'm not kidding about this here's the generated code for just one pixel out of a 4x1 block:

lea         ebx,[edi+1] 
mov         ecx,dword ptr [esp+70h]
mov         edx,dword ptr [esp+74h]
movd        xmm2,ecx
movaps      xmmword ptr [esp+100h],xmm2
movd        xmm3,edx
movaps      xmmword ptr [esp+130h],xmm3
mov         edx,dword ptr [esp+17Ch]
mov         esi,dword ptr [esp+78h]
lea         ecx,[eax+1]
movd        xmm4,esi
movaps      xmmword ptr [esp+160h],xmm4
shl         edx,2
mov         esi,dword ptr [esp+38h]
mov         dword ptr [esp+190h],esi
mov         esi,dword ptr [esp+178h]
imul        eax,edx
imul        ecx,edx
mov         edx,dword ptr [esp+178h]
lea         edx,[edx+eax]
lea         esi,[esi+eax]
mov         eax,dword ptr [esp+178h]
movd        xmm5,dword ptr [edx+edi*4]
movd        xmm6,dword ptr [esi+ebx*4]
mov         esi,dword ptr [esp+178h]
lea         esi,[esi+ecx]
lea         eax,[eax+ecx]
movd        xmm7,dword ptr [esi+edi*4]
movd        xmm0,dword ptr [eax+ebx*4]
punpcklbw   xmm5,xmm5
punpcklbw   xmm6,xmm6
punpcklbw   xmm7,xmm7
punpcklbw   xmm0,xmm0
punpcklwd   xmm5,xmm5
punpcklwd   xmm6,xmm6
punpcklwd   xmm7,xmm7
punpcklwd   xmm0,xmm0
psrld       xmm5,18h
psrld       xmm6,18h
psrld       xmm7,18h
psrld       xmm0,18h
cvtdq2ps    xmm5,xmm5
cvtdq2ps    xmm6,xmm6
cvtdq2ps    xmm7,xmm7
cvtdq2ps    xmm0,xmm0
mov         eax,dword ptr [esp+48h]
mov         ebx,dword ptr [esp+58h]
mov         ecx,dword ptr [esp+68h]
movd        xmm1,dword ptr [esp+190h]
movd        xmm4,eax
pshufd      xmm1,xmm1,0
pshufd      xmm4,xmm4,0
movd        xmm3,ebx
movd        xmm2,ecx
pshufd      xmm3,xmm3,0
pshufd      xmm2,xmm2,0
mulps       xmm0,xmm1
mulps       xmm7,xmm3
mulps       xmm1,xmm6
movaps      xmm6,xmmword ptr [esp+130h]
addps       xmm7,xmm0
mulps       xmm3,xmm5
movaps      xmm5,xmmword ptr [esp+100h]
mulps       xmm4,xmm7
movaps      xmm7,xmmword ptr [esp+160h]
addps       xmm3,xmm1
shufps      xmm5,xmm5,93h
mulps       xmm2,xmm3
shufps      xmm6,xmm6,93h
addps       xmm2,xmm4
movaps      xmmword ptr [esp+80h],xmm2
shufps      xmm7,xmm7,93h
mov         edx,dword ptr [esp+80h]
mov         esi,dword ptr [esp+84h]
movd        xmm0,edx
movd        xmm1,esi
mov         esi,dword ptr [esp+17Ch]
addps       xmm5,xmm0
movaps      xmmword ptr [esp+0F0h],xmm5
addps       xmm6,xmm1
movaps      xmmword ptr [esp+120h],xmm6
mov         edi,dword ptr [esp+88h]
mov         eax,dword ptr [esp+14h]
movd        xmm2,edi
mov         ebx,dword ptr [esp+24h]
addps       xmm7,xmm2
movaps      xmmword ptr [esp+150h],xmm7

Now imagine that included four times for every texld instruction in your shader.

Needless to say, this bloats the generated code very quickly, and it's not unusual to see a compiled pixel shader exceed 4K. Have you read the SIGGRAPH paper on Larrabee, where they explained that texture sampling couldn't be done efficiently on the main core? Well, here's an example. Part of this is due to SSE2's poor support for expanding byte components into floats and all of the data conversions needed to get coordinates and subtexel offsets to the right places, but there are also some optimization issues in this specific implementation. The most glaring one is the use of DIVPS to divide the components by 255 after bilinear filtering -- about twenty times slower than multiplication by inverse. The generated code also does runtime branching based on whether a sampler has bilinear filtering enabled, and computes a*(1-f) + b for linear interpolation when a + (b-a)*f probably would be faster. I didn't see any optimization for non-dependent texture fetches, so there's no major advantage to avoiding dependent reads not that you could optimize them out much without access to the vertex shader anyway. One implication of all of this is that in some cases you may be better off using more ALU ops rather than a texture loop, even though the texture lookup would be much faster on the GPU.

The output section is the other slow part of the generated code. As I said above, the code works on 4x1 blocks, so it has to handle the oddballs at the end. Unfortunately, it does so by storing the vector and then copying each pixel with scalar ops and a branch check, so it incurs store forwarding stalls and is a bit slower than it could be:

movdqa      xmmword ptr [esp-40h],xmm6 
test        esi,esi
mov         edi,dword ptr [esp-40h]
mov         dword ptr [edx],edi
je          047501A3
mov         eax,dword ptr [esp-3Ch]
lea         esi,[esi-1]
mov         dword ptr [edx+4],eax
test        esi,esi
je          047501B5
mov         ebx,dword ptr [esp-38h]
lea         esi,[esi-1]
mov         dword ptr [edx+8],ebx
test        esi,esi
je          047501C7
mov         ecx,dword ptr [esp-34h]
lea         esi,[esi-1]
mov         dword ptr [edx+0Ch],ecx

I would have liked to see a straight 4x loop with a vector store followed by fixup code instead. The difference won't be noticeable for long shaders, but you might notice it on a very short one, such as if you're using the effect to generate an image instead of transforming one.

Overall, the performance of the generated code is decent for ALU operations, but there's significant loop overhead and texture sampling is slow, so you want to avoid multipassing as much as possible. I will say that the overhead of the rest of WPF seems to be a much bigger problem than the jitted code; I did see some major slowdowns in the <10 fps range once I tried more complicated shaders, but a lot of the slowness was due to what looked like a slow alpha blend routine in wpfgfx_0300.dll and a lot of wasteful per-frame allocation, which caused the GC heap size to skyrocket. I don't care if I do have 2GB of memory it's obnoxious for a simple app displaying an image to jump up to 1.5GB and start swapping things out just because I resized the window.

Bugs

Overall, the shader jitter in WPF is a lot more robust than I had expected.

One of the first mistakes I made was to use sampler s0 without binding anything to it. This works fine in hardware probably by chance but it drove me nuts when I was trying to test the software mode and couldn't figure out why no sampler was bound. The software engine also returns the wrong value for an unbound sampler, giving a dull red when the color components should be all zero.

The rsq instruction is supposed to have no more than 1/2^-22 error, but the jitter compiles it to an RSQRTPS instruction, which only guarantees 1/2^-12 error. This means that expressions involving sqrt(), rsqrt(), and length() may be a bit sketchy in precision. The same goes for rcp, though I've heard modern CPUs actually compute it to much higher precision.

There are some funny little issues in the generated code, such as this fragment from a texld instruction:

maxps       xmm5,xmm4
maxps       xmm6,xmm4
maxps       xmm5,xmm4
maxps       xmm6,xmm4

I couldn't think of why you'd need to do this, even with NaN behavior involved (minps/maxps are asymmetric with respect to specials). I initially suspected that this was a blown texture clamp and that two minps instructions were missing, but I couldn't get it to blow up with negative texture coordinates.

Conclusion

The WPF pixel shader jitter is actually fairly robust and performant, and should reliably support just about any shader that works in hardware mode. I believe it's currently the closest you can get to high-performance vectorized code purely from C#. My main complaint is that it's in WPF this is technology that I would have liked to see in a core API somewhere in DirectX or Windows, rather than in the .NET Framework.

Comments

Comments posted:


Avery, you can expand 4 successive (RGBA) bytes into 4 floats using SSE2:

pxor xmm0, xmm0
movd xmm1, dword ptr [rgba_pixels]
punpcklbw xmm1, xmm0
punpcklwd xmm1, xmm0
cvtdq2ps xmm1, xmm1

Note that you can reuse xmm0 for the next four bytes.

Of course, using SSSE3 pshufb is faster, and in my tests it is even faster than using SSE4.1 movzxbd.

Unfortunate part is that without pshufb you cannot easily de-interleave RGBA quads before converting them to float so you have to use unpcklps/unpckhps after the conversion.

Igor Levicki (link) - 26 09 08 - 22:09


That is the obvious way, but three ops per pixel isn't exactly fast (four if you count the scale afterward). Just fetching and converting the 2x2 pixels for a bilinear fetch already puts you at 12 ops... yuck. It's also a pretty ugly situation on the execution ports, since all three uops have to issue to port 1 on Core 2. It's even worse on P-M, where the shift instructions issue 2 uops to port 0 each. An alternative that sometimes works better is to OR in the bit pattern for a base float (3F800000 for 1.0f), since POR can issue to any of the main ports. If you're doing a bunch of linear operations, such as a bilinear filter, you can just subtract out the bias at the end. It's also much easier to do if you can only depend on SSE support and not SSE2 (Athlon XP).

PSHUFB is cool, but it's advantage is a bit negated on 32-bit by either register pressure or the extra load uops you incur for all the swizzle constants.

Not sure what exactly you mean by deinterleaving RGBA quads, but if you're looking to extract the channels across four pixels at a time, that's just a matter of packing four pixels into an XMM register first and then shifting out one channel at a time. UNPCKLPS/HPS don't seem to be very cheap instructions, at least if the uop tables in pentopt are correct.

Phaeron - 27 09 08 - 02:35


By deinterleaving I meant combining reds from four pixels into one register, greens into another, etc. I think that unpcklps/unpckhps should not be slower than shifting/masking you suggest.
As for PSHUFB, it is a 1 clock instruction on Penryn (0.5 on upcoming Nehalem due to two shuffle units) and memory mask does not slow the things down at least not in my testing since it probably stays in the cache.
The worst thing in my opinion is the pixel fetch itself, especially if you have to do it one pixel at a time (unaligned data).

Igor Levicki (link) - 28 09 08 - 21:47

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.