As I said in the previous blog entry, one of the new features in 1.6.11 is the ability to bind custom vertex and pixel shaders to a video display pane. GPUs are great for massive image manipulation (unless you happen to have one with "Extreme" in the name), and thus it's only natural that they'd be useful for video. More importantly, though, it is easier to quickly write an optimized shader than to write an optimized software filter. This makes GPU shaders attractive also for rapid prototyping, which I hope is what the new shader support in 1.6.11 will enable.
To activate custom video shader mode, select Options > Preferences from the menu, jump to Display, then enable DirectX, Direct3D, and effect support. Then enter the .fx filename — relative paths are from the program directory, absolute paths are used directly. Dismiss the dialog, then deselect and reselect VirtualDub for the change to take effect.
D3DX DLL hassles
Before I begin, I have to rant about D3DX.
D3DX is the library that ships with the Microsoft DirectX SDK and which includes several useful, or rather, nearly essential, components such as the shader compiler and assembler. Starting sometime around December 2004, the D3DX library was changed from being a statically linked library to a DLL. This has the advantages of avoiding compatibility issues between C++ compilers and reducing working set footprint. However, Microsoft also:
- Removed the static library option.
- Renamed the DLL with each successive version: d3dx9_24.dll, d3dx9_25.dll, d3dx9_26.dll....
- Changed interfaces with each version: ID3DXBaseEffect has different methods and IIDs between December and April SDK, but the same name.
- Allowed it to be redistributed only with DirectXSetup, which installs it into the system folder and thus prohibits side-by-side installs.
- Didn't include it in the DirectX 9.0c runtime that's on Windows Update.
- Didn't modify the DirectXSetup installer UI between SDK versions, so it still says it's installing DirectX 9.0c even if it's installing a D3DX DLL you don't have.
So, basically, the D3DX DLLs are treated as system DLLs, but they don't come with the OS or any patches to the OS, and every application is supposed to include one, even though it's often bigger than the application that uses it. Supposedly, part of the reasoning behind this is that it allows Microsoft to update a D3DX version after the fact if it happens to contain a security flaw.
I don't know about other ISVs, but personally, I like to ship applications in as few locally-contained files as possible and for them to never change unless I explicitly update them. From what I gather, this is becoming quite a mess, because applications are forced to start the DirectX installer in their setup process, and lots of people are canceling it because they think they already have DirectX 9.0c installed, only to find that the application doesn't start because of a cryptic link error on the missing D3DX library.
VirtualDub 1.6.11 uses d3dx9_25.dll, which is used by the April 2005 SDK. It dynamically links to D3DX so that the DLL is only required if you are using the video shader support. The distribution doesn't include the D3DX DLL for various reasons, but there is a link in the help file to download the installer from Microsoft. As of today, it's at this location:
Some of you will already have it, as I think it's installed by some popular games.
Video shader compilation and invocation
The video shader support makes use of the D3DX effect system, which means that it takes standard .fx files. The format of these files is documented in the Microsoft DirectX SDK documentation; they allow vertex shaders, pixel shaders, and multiple passes to be defined. This makes them very flexible and powerful. In addition, VirtualDub will reparse and recompile the .fx file every time it gains focus, so iterating on a shader simply requires that you task switch to your editor, save the file, and task switch back. Effect compilation is very fast so this is much less of a problem than you might think.
The effect file can supply up to three techniques, which are bound to the point, bilinear, and bicubic filtering modes on the context menu (right-click menu) of the display panes. These techniques do not have to actually implement those filtering modes; this is simply a convenient hack to allow three different sets of shaders to be used. Once a technique is selected, VirtualDub uses it to render a single quad on screen with the source image bound as a texture. The shader can then do whatever it wants to the backbuffer, using multiple passes if necessary.
As documented in the help file, there is support for texture shaders to pre-initialize textures. These are software-emulated "shaders" that are used to initialize custom lookup textures. When I first discovered this feature I thought it would be horribly slow, as some of the D3DX image processing routines have been in the past, but it turns out they're decently fast, and generating several small 32x32 lookup textures at startup is no big deal.
Video shaders are, unfortunately, only really usable on hardware that supports pixel shader 2.0 or higher. For NVIDIA cards, this means GeForce FX or later, and for ATI, 9500+ or higher. It is possible to video shaders on ps1.4 (Radeon 8500), although that is rather limiting, and it can become quite difficult to do useful effects with pixel shader 1.1-1.3 (GeForce3/4), although still possible, especially with multipass and precomputed textures.
Writing video shaders
The simplest, actually useful shader is to just sample the entire source texture. This gives point and bilinear displays. After that, you can do a lot of interesting but not-so-useful effects using Sobel edge-detection filters and contrast/brightness/gamma filters, but you can't make too many effects that you'd actually want to use for preview or watching a video just with one texture lookup.
A custom pixel shader, as it turns out, isn't enough to do very interesting (and actually usable) video shading effects. The main problem is that if you only have the pixel shader, you have to do a lot of arithmetic in the pixel shader that could otherwise be factored out. For ps2.0 and higher profiles this is merely a performance annoyance in wasted clocks, but for ps1.4 and earlier it can be a dealkiller — among the problems is that it can force all of your texture lookups to be dependant texture reads, which can seriously cripple image processing capability. So, despite only processing four vertices, the vertex shader is quite important, as it can precompute as many as 40 interpolated values that then don't have to be computed in the pixel shader. Convolution filters, in particular, can have up to eight tap locations computed by the vertex shader.
ps1.4, by the way, is very annoying. You would never know it from the SDK documentation, but this shader model is only necessary for one range of video cards, the RADEON 8xxx series. Everything else either doesn't support this pixel shader model or has ps2.0, which is vastly more powerful. ps1.4 has a bunch of useful abilities that aren't in ps1.1-1.3, but when I tried recoding some of my shaders into ps1.4 so a friend could try them, I ran into precision problems with dependant texture reads. The problem is that in ps1.4, dependant texture coordinates are computed using fixed-point color ALUs instead of floating-point addressing ALUs. While the 8500 does have significantly better precision and range, 4.12 isn't enough for large textures. With a 1024x512 texture, that only gives two bits of subtexel positioning in the U axis, which shows clearly in warpsharp algorithms.
Bicubic interpolation turns out to be difficult to do efficiently in a pixel shader because it requires 16 taps to be sampled. The easiest way to do it is to just compute the entire filter kernel in the pixel shader and sample 16 times, but that results in a pixel shader of about 40-50 instruction slots. NVIDIA used to have a sample shader for this in the FX Composer distribution. This can be reduced dramatically by precomputing the taps in a filter texture and doing two separable filter passes, dropping it down to 10 texlds + 8 ALU ops per pixel. This is what VirtualDub's normal DX9 display minidriver does in bicubic mode. That requires that the sampling pattern be regular, though, which won't work for displacement mapping. For bicubic kernels that consist only of positive values, such as a B-spline kernel, it is possible to do bicubic with a per-pixel varying sample location, by doing four bilinear samples with the appropriate offsets; GPU Gems 2 has an example. It can be done for cardinal spline kernels with negative lobes, but it requires a rather nasty hack (complementing every other sample in a checkerboard pattern).
Warpedge, the test case
The primary test case I used for the video shader support was an experimental algorithm I'd been working on called warpedge, which is a variant of warpsharp. Warpsharp is an algorithm I originally found in a filter for the GIMP image processing tool, which sharpens an image by warping it toward the interior of edges. It is implemented by computing a bump map from edges in the image, then taking the gradient of the bump map to produce a displacement map. Warpsharp can produce some very impressive results, but it is also sensitive to noise, and can't really darken or brighten edges, only narrow them. There's a software version of the algorithm on my filters page.
Warpedge doesn't try to narrow edges; instead, it simply attempts to sharpen them. This is done primarily in two ways: by normalizing and limiting the length of gradient vectors, and by re-applying the difference between a straight interpolated image and the warped image. This seems to work decently well, although admittedly I tested it much more on anime than on real-world material. Also, warpedge is meant for interpolation; a stretch operation is part of the algorithm, and it doesn't work very well on 1:1 processing.
Here's the main version:
This is a four-pass shader; it could probably be reduced to three by merging the blur and gradient passes, but I don't know if it's worthwhile. The final pass barely fits in ps_2_0, because it computes a gradient vector and two bicubic A=-0.75 fetches — I had to hack on it a bit to get it to fit. (Such is the curse of prototyping on a GeForce 6800.) As it is, it takes 18 texture and 59 arithmetic instruction slots. The "bicubic" mode on the right-click context menu of the display pane selects the warpedge algorithm, while the "bilinear" mode selects a bicubic algorithm, and "point" selects bilinear. Yes, this is confusing, but you can't change the names in the context menu from the shader effect file. At least this way you can quickly flip between the modes for comparison.
Once I had finished writing the ps_2_0 version... and, uh, testing it on several episodes of Mai-HiME... I started working on the ps_1_1 version for a challenge. The problem with ps_1_1 is the pitiful support for dependant texture reads. I tried texbem and texbeml at first, but while the shader worked fine on a 6800, it didn't work correctly on a GeForce 4 Ti4600, because the older hardware expects a signed texture format, not the unsigned format that can be rendered to. In the end I found that texm3x2pad + texm3x2tex doesn't suffer from the same issue, so I was able to get a version of warpedge to work:http://www.virtualdub.org/other/warpedge1_1.fx
This version isn't quite as effective as it only uses bilinear interpolation. The mode mappings are: point -> point, bilinear -> bilinear, bicubic -> warpedge. In theory I could get bicubic to work, but it'd take a lot more passes — I'd estimate something like two more passes for the base bicubic layer, and four more passes for the displacement layer — and unfortunately, it'd probably need three render target textures, whereas VirtualDub currently only allows access to two. It might be possible to use the backbuffer as the third, but precision when blending in the framebuffer isn't great.
Also, I continued my tradition of breaking the compiler while writing this one. Attempting to do "dcl_2d t0.xy" in a ps_1_1 asm shader resulted in this nice error:
dep.fx(456): error X5487: Reserved bit(s) set in dcl info token! Aborting validation.
I also managed to trip an internal compiler error about uninitialized components in the vertex shader, but the error went away as mysteriously as it came.
Overall, though, I'm pretty satisfied with the results for a first pass, especially since it runs in real-time at 1920x1200, and isn't overly sensitive to noise.