Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

Taking a look at D3D10.1's WARP driver

I recently went through the exercise of writing a basic Direct3D 10.1 display backend for VirtualDub. The primary motivation was to take advantage of Direct3D 10.1 command remoting... until I realized that the DirectX SDK I was using was a bit old and its documentation didn't mention that D3D10.1 command remoting had been removed in Windows 7 RTM. I did get it working in windowed mode, however, and since I had a working D3D10.1 path I figured I might as well check out the WARP driver.

WARP, or Windows Advanced Rasterization Platform, is a software driver that ships with the Direct3D 11 runtime. As far as I know, it's the first widely available and full featured software renderer that Microsoft has shipped. The DirectX SDK has long shipped with the reference rasterizer (refrast), but that has several shortcomings: it's not redistributable, it can't be instantiated in a headless environment, and it's so abysmally slow that it barely works for debugging much less running anything. Microsoft also created RGBRast for DirectX 9 which .NET 3.5 used as a software fallback, but AFAIK it doesn't support shaders and is pretty minimal. The OpenGL software rasterizer works but it pretty slow and lacking on features. WPF has its own software rasterizer that I've written about before and isn't too bad, but it only does pixel shaders on rectangular blits and is internal to WPF. Now we have WARP, which is fully featured, fast, and widely available.

Having used WARP a little bit, I can tell you that you won't be ditching your 3D graphics card anytime soon. When I say WARP is fast, I mean it's fast by software rasterizer standards, which means it might beat an S3 ViRGE. It's still very slow compared to any modern graphics accelerator, even one with "Integrated" in its name, and I get dropped frames drawing one 1440x900 full screen quad on an i5-2500K. That's even before you take into account that even to get that level of performance you have to give up a lot of CPU power that could be used for something else. The main benefit of WARP is that programs can now use 3D rendering without worrying about being 100% screwed in the unusual case where no 3D hardware acceleration whatsoever is available. Considering the difficulty of writing a general 3D software driver, that's a big benefit.

Now, that out of the way, time to look at the details: let's look at what code WARP generates.

I haven't done much with WARP, and VirtualDub's display code is extremely undemanding in the 3D features that it uses. However, it's a good start for seeing how WARP works for non-game applications that are basically looking for a good blitter. We'll use this HLSL effect, which is just meant to draw a quad on screen:

extern Texture2D<float4> srct : register(t0);
extern SamplerState srcs : register(s0);
void VS(
float2 pos : POSITION,
out float4 oPos : SV_Position,
out float2 oT0 : TEXCOORD0)
{
oPos = float4(pos * float2(2, -2) + float2(-1, 1), 0, 1);
oT0 = pos;
}
float4 PS(float4 pos : SV_Position, float2 t0 : TEXCOORD0) : SV_Target {
return srct.Sample(srcs, t0).bgra;
}

We're drawing a whopping four vertices on the screen, so the vertex shader is basically irrelevant for performance, but the pixel shader will be used much more heavily and is what we are interested in. Any modern software rasterizer is going to make use of a just-in-time (JIT) compiler for generating the inner rasterization loop, and WARP is no exception. Since WARP is going to be chewing up gobs of CPU time and runs in-process, find the rasterization loop is not a problem: just break execution in the debugger.

Examining the loop

Due to vectorization, the rasterization loop is very large, so I've put the disassembly at the end of this post instead of here, but let's start by walking through the general layout. This was generated on a CPU with AVX; WARP takes advantage of SSE4.1 but not AVX.

WARP makes use of the SSE2 instruction set and register file, and as such has 4x parallelism available for most operations. One way to take advantage of this is to process all four channels of a pixel in parallel (RGBA), but that's slow for a number of operations that require cross-channel interactions, such as swizzles and dot products. The alternate strategy is to process four pixels at a time with each 4-vector holding four scalar values for each pixel. This reduces shuffling bottlenecks and makes code optimization much easier as the intermediate code can just be treated as scalar, with the downside being increased register pressure from having four times as many pixels in flight. This is the strategy that WPF's rasterizer uses, and it's also the one I picked for my vdshader JITter. WARP's situation is a little bit different because it needs to handle mipmapping and gradient determination, so while it also does four pixels at a time, it does 2x2 quads instead of four pixels horizontally. Most stages just treat them as individual pixels, so this doesn't matter except for a few specific stages like interpolation and output.

Unlike WPF's rasterizer, WARP has to handle arbitrary triangles and thus it supports perspective correction. For the pixel shader above, this involves stepping three interpolated values -- u/w, v/w, and 1/w -- and then computing the reciprocal w = 1/w so that u and v can be computed. I was disappointed to find that WARP does not optimize either the divide itself or for screen-aligned polygons:

movaps xmm3, one ;1.0

divps xmm3, xmm1 ;w = 1 / (1/w)

mulps xmm0, xmm3 ;u = (u/w) * w

mulps xmm1, xmm3 ;v = (v/w) * w

The perspective divide is done with a straight divide instruction instead of reciprocal estimation and refinement. I understand WARP was designed to be general and accurate, so it may be that approximations may not have been sufficient, and in any case with a non-trivial shader this is not going to matter much. In this case, though, it's unnecessary as w is a constant for 2D operations. This means a couple dozen instructions of overhead that 2D applications don't need.

The next section is the address setup for the texture fetch. In this case, mipmapping is disabled and the U/V addressing modes are set to CLAMP. What WARP does here is tidy up the texture coordinates in floating-point (SSE), then it switches to integer (SSE2) to continue computing the addresses in parallel. Since bilinear filtering is enabled, a 2x2 block of pixels has to be fetched. The tricky part about this is that the 2x2 block can extend outside of the texture by one pixel even if the texture coordinate is already clamped or wrapped to 0-1. In vdshader, I handled this by adding borders to the texture storage and copied pixels into the borders beforehand so that a 2x2 block of pixels can be fetched with address offsets; WARP eschews this and instead computes 16 addresses, four per pixel. It then extracts them one at a time with PEXTRD and merges pixels back into vectors with PINSRD.

Afterward is the bilinear filtering, followed by the .bgra swizzle in the pixel shader. (D3D10 wants RGBA textures, so this swizzle is to read a BGRA image that has been aliased into that format.) WARP does the bilinear filtering in integer math by splitting into red/blue and green/alpha pairs. What's surprising is that WARP doesn't then convert the 32-bit pixels into floats -- it instead keeps them packed and does the swizzle in integer math. Chances are this only works for really simple shaders, but this avoids expensive unpacking and conversion to floats followed by conversion and packing back to bytes. It's too bad that WARP didn't take advantage of PSHUFB to do this; it seems that an oddity in the bilinear filtering causes one of the channels to be offset by 1 bit and thus the generated code uses a shift orgy instead to get everything in order.

The final section in the rasterization loop is the output stage:

movq        xmm2,mmword ptr [ebx]
movq xmm1,mmword ptr [ebx+eax]
punpcklqdq xmm2,xmm1
movdqa xmm0,xmmword ptr [ebp-1A0h]
pblendvb xmm2,xmmword ptr [ebp-1D0h],xmm0
movq [ebx],xmm2
punpckhqdq xmm2,xmm2
movq [ebx+eax],xmm2

As I said earlier, WARP rasterizes in 2x2 quads instead of 4x1 strips, and therefore it has to fetch and store 8 bytes from two adjacent scan lines (MOVQ instructions). It packs them together into a single vector for blending and then unpacks it afterward, which in this case is probably slower than it would have been to split the pixel shader output and do two blends. The PBLENDVB instruction selects between parallel bytes in two different vectors and appears to be for color write mask support, where individual RGBA channels in the frame buffer can be enabled or disabled for write. In this case all color channels were enabled for write and there was no need to read or merge from the source, so this is all unnecessary. As with the perspective divide, though, it's pretty small fish compared to the rest of the loop.

It's worth noting that WARP supports x64 as well as x86. I didn't spend much time looking at the generated 64-bit code, but it looks mostly the same except for fewer register spills due to the larger register file.

What this means

As far as I can tell, WARP is a well-written and performant software renderer. It's also trivial to enable in an existing D3D10/11 based program. You could do a lot worse than using WARP, and if you need a general fallback for a bit of 3D rendering I'd seriously consider it over another one or writing a custom one. That's assuming of course that you can use it -- it's a bummer that it's only available for Vista and up and requires a DX10 or DX11 based application. As with any other software renderer, WARP still doesn't perform miracles and if you're doing any non-trivial amount of 3D graphics it will not make up for a missing 3D accelerator.

What it isn't necessarily good for is 2D image operations. It will do image blits and compositing just fine, and undoubtably it'd be better than a random routine written on top of GetPixel() and PutPixel(). You could use it for rendering UI and get acceptable performance. However, WARP will get stomped hard by specialized 2D rendering code, as just its perspective correction code alone is bigger than the inner loop of a bilinear stretch routine, and it is severely hampered by the complexity of emulating texture fetches on the CPU. I haven't done any benchmarks but it's possible that GDI+ is faster. Therefore, even though it takes advantage of SSE2 and JIT compilation, there are still much better options for a 2D image processing core.

Appendix: WARP generated code

This is the inner loop generated by WARP for the pixel shader.

paddd xmm0,xmmword ptr [ebp-170h]

movaps xmm1,xmmword ptr [ebp-100h]

mulps xmm1,xmmword ptr [ebp-160h]

movdqa xmm2,xmmword ptr [ebp-150h]

pcmpgtd xmm2,xmm0

addps xmm1,xmmword ptr [ebp-140h]

movaps xmm0,xmmword ptr [ebp-50h]

mulps xmm0,xmmword ptr [ebp-160h]

movaps xmm3,xmmword ptr ds:[7EF917A0h]

divps xmm3,xmm1

addps xmm0,xmmword ptr [ebp-180h]

movaps xmm1,xmmword ptr [ebp-80h]

mulps xmm1,xmmword ptr [ebp-160h]

mulps xmm0,xmm3

addps xmm1,xmmword ptr [ebp-120h]

movaps xmm4,xmmword ptr ds:[7EF91790h]

minps xmm4,xmm0

mulps xmm1,xmm3

movaps xmm0,xmmword ptr ds:[7EF91780h]

maxps xmm0,xmm4

movaps xmm3,xmmword ptr ds:[7EF91790h]

minps xmm3,xmm1

mulps xmm0,xmmword ptr [ebp-40h]

movaps xmm1,xmmword ptr ds:[7EF91780h]

maxps xmm1,xmm3

addps xmm0,xmmword ptr ds:[7EF91760h]

mulps xmm1,xmmword ptr [ebp-0D0h]

movaps xmm3,xmm0

cmpeqps xmm0,xmm0

addps xmm1,xmmword ptr ds:[7EF91760h]

pand xmm0,xmm3

movaps xmm3,xmm1

cmpeqps xmm1,xmm1

mulps xmm0,xmmword ptr ds:[7EF91750h]

pand xmm1,xmm3

cvtps2dq xmm0,xmm0

mulps xmm1,xmmword ptr ds:[7EF91750h]

movdqa xmm3,xmm0

psrad xmm0,8

pand xmm3,xmmword ptr ds:[7EF91740h]

movdqa xmm4,xmmword ptr ds:[7EF91730h]

paddd xmm4,xmm0

movdqa xmm5,xmm3

pslld xmm3,10h

pmaxsd xmm0,xmmword ptr ds:[7EF917E0h]

por xmm3,xmm5

pminsd xmm0,xmmword ptr [ebp-0C0h]

pmaxsd xmm4,xmmword ptr ds:[7EF917E0h]

cvtps2dq xmm1,xmm1

pminsd xmm4,xmmword ptr [ebp-0C0h]

movdqa xmm5,xmm1

psrad xmm1,8

pand xmm5,xmmword ptr ds:[7EF91740h]

movdqa xmm6,xmmword ptr ds:[7EF91730h]

paddd xmm6,xmm1

movdqa xmm7,xmm5

pslld xmm5,10h

pmaxsd xmm1,xmmword ptr ds:[7EF917E0h]

por xmm5,xmm7

pminsd xmm1,xmmword ptr [ebp-0E0h]

pmaxsd xmm6,xmmword ptr ds:[7EF917E0h]

pmulld xmm1,xmmword ptr [ebp-0F0h]

pminsd xmm6,xmmword ptr [ebp-0E0h]

movdqa xmmword ptr [ebp-190h],xmm3

movdqa xmm3,xmm0

paddd xmm0,xmm1

paddd xmm1,xmm4

movd eax,xmm0

pextrd ecx,xmm0,1

movd xmm7,dword ptr [edx+eax*4]

pextrd eax,xmm0,2

pinsrd xmm7,dword ptr [edx+ecx*4],1

pextrd ecx,xmm0,3

pinsrd xmm7,dword ptr [edx+eax*4],2

movd eax,xmm1

pinsrd xmm7,dword ptr [edx+ecx*4],3

pextrd ecx,xmm1,1

movdqa xmm0,xmmword ptr ds:[7EF91720h]

pand xmm0,xmm7

psrlw xmm7,8

movdqa xmmword ptr [ebp-1A0h],xmm2

movd xmm2,dword ptr [edx+eax*4]

pextrd eax,xmm1,2

pinsrd xmm2,dword ptr [edx+ecx*4],1

pextrd ecx,xmm1,3

pinsrd xmm2,dword ptr [edx+eax*4],2

pmulld xmm6,xmmword ptr [ebp-0F0h]

pinsrd xmm2,dword ptr [edx+ecx*4],3

paddd xmm3,xmm6

movdqa xmm1,xmmword ptr ds:[7EF91720h]

pand xmm1,xmm2

psrlw xmm2,8

movd eax,xmm3

pextrd ecx,xmm3,1

movdqa xmmword ptr [ebp-1B0h],xmm2

movd xmm2,dword ptr [edx+eax*4]

pextrd eax,xmm3,2

pinsrd xmm2,dword ptr [edx+ecx*4],1

pextrd ecx,xmm3,3

pinsrd xmm2,dword ptr [edx+eax*4],2

paddd xmm4,xmm6

pinsrd xmm2,dword ptr [edx+ecx*4],3

movd eax,xmm4

movdqa xmm3,xmmword ptr ds:[7EF91720h]

pand xmm3,xmm2

psrlw xmm2,8

movd xmm6,dword ptr [edx+eax*4]

pextrd eax,xmm4,1

pextrd ecx,xmm4,2

pinsrd xmm6,dword ptr [edx+eax*4],1

pextrd eax,xmm4,3

pinsrd xmm6,dword ptr [edx+ecx*4],2

movdqa xmm4,xmm0

psllw xmm0,8

pinsrd xmm6,dword ptr [edx+eax*4],3

psubw xmm3,xmm4

movdqa xmm4,xmmword ptr ds:[7EF91720h]

pand xmm4,xmm6

psrlw xmm6,8

pmullw xmm3,xmm5

movdqa xmmword ptr [ebp-1C0h],xmm6

movdqa xmm6,xmm7

psllw xmm7,8

paddw xmm3,xmm0

psubw xmm2,xmm6

psrlw xmm3,1

pmullw xmm2,xmm5

movdqa xmm0,xmm1

psllw xmm1,8

paddw xmm2,xmm7

psubw xmm4,xmm0

psrlw xmm2,1

pmullw xmm4,xmm5

movdqa xmm0,xmmword ptr [ebp-1B0h]

psllw xmm0,8

paddw xmm4,xmm1

movdqa xmm1,xmmword ptr [ebp-1C0h]

psubw xmm1,xmmword ptr [ebp-1B0h]

psrlw xmm4,1

pmullw xmm1,xmm5

psubw xmm4,xmm3

paddw xmm1,xmm0

movdqa xmm0,xmmword ptr [ebp-190h]

psllw xmm0,8

psrlw xmm1,1

movdqa xmm5,xmm0

pmulhuw xmm0,xmm4

psraw xmm4,0Fh

psubw xmm1,xmm2

pand xmm4,xmm5

movdqa xmm6,xmm5

pmulhuw xmm5,xmm1

psubw xmm0,xmm4

psraw xmm1,0Fh

paddw xmm0,xmm3

pand xmm1,xmm6

movdqa xmm3,xmmword ptr ds:[7EF91710h]

pand xmm3,xmm0

psubw xmm5,xmm1

psrld xmm0,10h

paddw xmm5,xmm2

psrld xmm3,7

movdqa xmm1,xmmword ptr ds:[7EF91710h]

pand xmm1,xmm5

psrld xmm5,10h

pslld xmm3,10h

psrld xmm5,7

psrld xmm1,7

pslld xmm5,18h

pslld xmm1,8

psrld xmm0,7

movq xmm2,mmword ptr [ebx]

por xmm0,xmm1

mov eax,dword ptr [ebp-88h]

movq xmm1,mmword ptr [ebx+eax]

por xmm0,xmm3

punpcklqdq xmm2,xmm1

por xmm0,xmm5

movaps xmm1,xmmword ptr [ebp-160h]

addps xmm1,xmmword ptr ds:[7EF91700h]

movdqa xmmword ptr [ebp-1D0h],xmm0

movdqa xmm0,xmmword ptr [ebp-1A0h]

pblendvb xmm2,xmmword ptr [ebp-1D0h],xmm0

movq mmword ptr [ebx],xmm2

punpckhqdq xmm2,xmm2

movq mmword ptr [ebx+eax],xmm2

movdqa xmm0,xmmword ptr [ebp-130h]

paddd xmm0,xmmword ptr ds:[7EF916F0h]

lea ebx,[ebx+8]

sub esi,1

movdqa xmmword ptr [ebp-130h],xmm0

movaps xmmword ptr [ebp-160h],xmm1

jne 7ef91260

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.