## § ¶Implementing cardinal spline bicubic filtering on 3D hardware

I had originally intended to post a massive pair of blog entries about bicubic filtering, but I committed the sin of Exiting Without Saving, so I guess I'll wing it instead.

Bicubic filtering is generally the next step up from bilinear filtering; it's not nearly as big of an improvement as bilinear is from point sampling (a.k.a. no filtering), but it's noticeably better if you are doing enlargement or gradient determination. In terms of implementing it on hardware, all decent 3D graphics hardware supports bilinear filtering, but none of them support bicubic. (Well, ATI's R600 is touted as doing so, but I haven't heard any information about how you'd actually use it, so as far as I'm concerned it doesn't exist.) That means bicubic has to be emulated, and as usual, the cheaper the implementation, the better.

Bicubic filtering, or rather, cardinal spline bicubic filtering, is a 4x4 (16 tap) filter composed of two 1-D 4 tap cubic filters. Well, actually only for interpolation, but that's still a very useful case. The main difficulty in implementing this in a shader is that the filter kernel contains two negative lobes, which corresponds to the outer two taps on the discrete filter. GPU Gems 2 has a way to implement B-spline bicubic via the bilinear filter hardware, but that relies on all of the taps being positive. NVIDIA also has a direct 2D implementation in bicubic.fx in their SDK, but it's absurdedly expensive, at ~50 clocks per pixel -- pretty close to not being real-time.

How to improve it, then?

Generally, if you can help it, you never want to implement a bicubic filter as a direct 2D filter. From a pure operation count, implmenting it as two 1-D horizontal/vertical passes takes half the operations, and it drops further when you realize that one of the passes often operates on fewer pixels than the other. This only works if you are interpolating pixels on an axis-aligned grid, however. That's fine if you're doing a resize; it's not applicable if you're doing something like a rotation. Anyway, the straightforward way of doing this in 3D is to use a filter texture containing the four weights for the four taps, and then four texture samplers each fetching the source taps. The pixel shader then just adds the weighted source pixels, i.e. w0*x0 + w1*x1 + w2*x2 + w3*x3. There are a surprising number of caveats in doing this (many of these apply to software implementations as well):

- The outer two weights are negative. This is easily rectified: store the magnitude and negate for free in the shader.
- Intermediate sums are outside of the range [0,1]. Specifically, the sum of the inner two pixels will exceed 1 by the same amount that the sum of the outer two taps is negative, by as much as 0.25. If you are working in pixel shader 1.1, you need to do some biasing tricks to avoid saturation on either end.
- The weights in the weight texture must be normalized
**after**conversion to fixed point. Consider: 1/4, 1/4, 1/4, 1/4. There is no way you can consistently round such that converting those weights to 8-bit fixed point would add up to 255. (And do remember that a 32-bit texture uses 1/255 as its unit, not 1/256. I hate it when people write YCbCr conversion shaders and use -0.5 as the chroma bias and not -128/255.) Don't just use a floating point texture, as that's the sissy way. - If you are using wrapping for the filter texture, you must use point sampling as the filter texture is discontinuous when the 4 tap window steps.
- The source taps have to use point samping.

The worst issue I've found, though, is in the UV positioning. The filter pattern has to restart **exactly** when the source taps step. If it doesn't, you sometimes get an ugly glitch on screen where the filter is off by almost a full pixel. For compatibility with pixel shader 1.1, I've used wrapping for the filter texture, which means that it relies on extremely high precision from the texture coordinate interpolators -- to give you some idea, 6-bit subpixel positioning with a 512x512 source texture requires 16 bit UV accuracy. I've never been able to get this quite right. You can see this if you enable Direct3D display mode and bicubic filtering in VirtualDub 1.6.19. NVIDIA's bicubic.fx instead clamps the filter texture coordinates and uses frac() in the pixel shader, but that doesn't work on ps1.1 and is slower due to a dependent texture read. I haven't seen that version glitch, but I suspect that it would if a non-power-of-two texture were used (I couldn't test this because FX Composer always uses pow2 textures).

So what's the alternative?

If you look closely at the bicubic filter kernel and how the four tap weights vary, you'll notice interesting behavior with regard to the stepping boundary that's causing the problem. Specifically, the outer two taps (negative) always go to zero at that boundary, so no glitch is possible there; it's only the inner two (positive) taps that are problematic. The trick I came up with is to replace one of the two point-sampled inner taps with a linearly interpolated tap and have the linear and point taps both centered. Given two original inner tap weights (w1, w2), the linearly interpolated tap weight (wi) and point sampled tap weight (wp) can be computed as follows:

t = phase (0 - 0.5)

w1 = wi*(1-t) + wp

w2 = wi*twi = w2/t

wp = w2*(1-t)/t - w1

The equations for t in [0.5, 1] can be computed analogously by moving the +wp contribution from w1 to w2, corresponding to when the point-sampled tap hops over a pixel; alternatively, just mirror it as the results are symmetric. There's a little problem with dividing by zero at t=0 or t=1 -- you can solve that either by substituting in the original cubic tap weight expressions for w1 and w2 and determining the limits, or simply fudge them (w2=0 at t=0 and w1=0 at t=1).

Why is this advantageous? Well, for one thing, wi and wp are also the same at t=0 and t=1, meaning that it's also possible to enable linear filtering on the filter texture (although that may introduce precision-related normalization problems in fixed point). More importantly, however, it moves the critical point where the point sampled tap hops from t=0/1 to t=0.5. At that point w1==w2 and if you run that through the above equations, you'll find that the contribution from that tap drops to *zero*, meaning that a glitch there is insignificant. The result is much less sensitivity to the precision of the texture interpolators, and thus a much more robust solution.

I'll end with a few teasers. One, there are a number of patterns in the cubic filter equations that I haven't exploited, one being that the sum of the outer two weights is At(1-t) and the sum of the inner two is (1 - At(1-t)). Two, it's possible to implement 1D cardinal spline cubic filtering on a GeForce 3/4 in a single pass, although not with great accuracy. The above solution doesn't work because you need five samplers, one for the filter kernel and four for the source taps. I have been able to get it down to four samplers, however, using three bump mapped, luma corrected fetches (tex + texbeml + texbeml + texbeml). Three, it's possible to accelerate the direct 2D solution, although in a totally different manner: XORing a checkerboard into the source texture. My current implementation, which I use for displacement mapping, requires 9 texture fetches instead of the regular 18.