v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum

Other projects
Altirra

## § ¶Implementing ELA in a pixel shader

I've been working on rewriting video algorithms in pixel shaders for 3D acceleration lately, and one of the sticking points I hit was Edge-Based Line Averaging (ELA) interpolation.

ELA is a common spatial interpolation algorithm for deinterlacing and works by trying several angles around the desired point and averaging between the points with the lowest absolute difference. The angles are chosen to be regular steps in sample location, i.e. (x+n, y-1) and (x-n, y+1) for n being small integers. This produces reasonable output for cases where a temporal or motion-based estimation is not available. The specific variant I'm dealing with is part of the Yadif deinterlacing algorithm, which checks three horizontally adjacent pixels for each angle and only picks the farthest two if the intermediate angle is a better match as well. In other words:

for dx = -2 to 2
error[dx] = difference(top[dx - 1], bottom[-dx - 1]) + difference(top[dx], bottom[-dx]) + difference(top[dx + 1], bottom[-dx + 1])

best_offset = 0;
for dx = -1 to -2:
if error[dx] < error[best_offset]:
best_offset = dx
else:
break

for dx = +1 to +2:
if error[dx] < error[best_offset]:
best_offset = dx
else:
break

result = average(top[dx], bottom[dx])

This should be a relatively simple translation to a pixel shader -- convert each source pixel access to a texture sample. Not. It turns out that under the Direct3D ps_2_0 profile, which is what I need to target, there aren't enough temporary registers to run this algorithm. In order to run the algorithm, at least 14 source pixel fetches need to be done, and there are only 12 temp registers in ps2.0. The HLSL compiler valiantly tries to squeeze everything in and fails. Nuts.

There is an important caveat to this implementation tack, which is that I had source pixels mapped in AoS (array of structures) format, i.e. a single pixel held YCbCr components and an unused alpha channel. The CPU implementation of this algorithm, at least the way I wrote it in VirtualDub 1.9.1+, uses SoA (structures of arrays) orientation for speed. SoA arranges the data as planes of identical components, so instead of mixing components together you fetch a bunch of Y values across multiple pixels, a bunch of Cb pixels, and a bunch of Cr pixels, etc. I decided to try this in the pixel shader, since texture fetches were my main bottleneck. It looked something like this:

float4 top0 = tex2D(src, top_uv0);   // top Y 0-3
float4 top1 = tex2D(src, top_uv1);   // top Y 4-7
float4 top2 = tex2D(src, top_uv2);   // top Y 8-11
float4 bot0 = tex2D(src, bot_uv0);   // bottom Y 0-3
float4 bot1 = tex2D(src, bot_uv1);   // bottom Y 4-7
float4 bot2 = tex2D(src, bot_uv2);   // bottom Y 8-11

float4 error_left2 = abs(top0 - bot1)
+ abs(float4(top0.yzw, top1.x), float4(bot1.yzw, bot2.x))
+ abs(float4(top0.zw, top1.xy), float4(bot1.zw, bot2.xy));

Switching to SoA in a pixel shader nullifies some of the advantages of the GPU, since the GPU doesn't use as long vectors as fixed-point hardware (4x vs. SSE2's 16x), and because some GPU hardware doesn't directly support the swizzles you need to emulate a shift. It also largely nullifies the advantage of having texture samplers since you can no longer address the source by individual samples. Well, it turns out in this case that the extra swizzling made the situation even worse than in the AoS case, because the compiler didn't even get halfway down the shader before it gave up.

The main lesson here is that sampling textures can quickly become a bottleneck in the ps_2_0 profile. Just because you have 32 texture sampling instructions available doesn't mean you can use them. I've thought about switching to a higher pixel shader profile, like ps_2_a/b, but there are reasons I want to try to stay to ps_2_0, the main ones being the wide platform availability, the hard resource limits, and the omission of flow control and gradient constructs.

In the end, I had to split the ELA shader into two passes, one which just wrote out offsets to a temporary buffer and another pass that did the interpolation. It works, but the GPU version is only able to attain about 40 fps, whereas the CPU version can hit 60 fps with less than 100% of one core. I guess that mainly speaks to my lopsided computer spec more than anything else. That having been said, it kind of calls into question the "GPUs are much faster" idea. I have no doubt this would run tremendously faster on a GeForce 8800GTX, but it seems that there are plenty of GPUs out there where using the GPU isn't a guaranteed win over the CPU, even for algorithms that are fairly parallelizable.

So, is shipping several versions for different profiles out of the question?

One popular game that I know of shipped with several shader archives. It looked the same on different cards, my guess is that it was mostly the very same shaders, compiled targeting different profiles for optimal performance. This makes sense since, while it didn't support 2.0, it did support 2.0b (increased instruction count but not control flow improvements IIRC) and then 2.0a (inc. inst. count and flow control).

It makes quite a lot of sense since things like dynamic branching can be a large performance win with no developer cost. In your case it's even more extreme, since it's 1 pass vs 2, but that would mean writing two code paths.

Also, 2.0 compatibility might not be worthy if it's going to run faster on software on these cards anyway, which is very likely for the GMA

John - 22 05 09 - 19:22

...which is very likely for the GMA less-than-or-equal 3100, which leaves just the Radeons 9500-9800.

John - 22 05 09 - 19:24

I've done the multiple profile tack in other situations, and it's something I'd like to avoid because it's a testing headache and requires a lot of 3D expertise. If I were to go that route, I'd probably just raise the minimum bar to 2.b/2.sw to 3.0, since cards that only support ps2.0 are probably too slow to use. They'll render at an acceptable rate, but the cost of reading back the result will cancel any benefits of using the GPU.

The main problems with allowing higher shader profiles is that they complicate translation to another shader language and increase the possibility that shaders may randomly fail to compile, due to the looser requirements. That's something I'd like to avoid.

ELA is probably one of the worst cases I've run into -- no other filter I've converted yet has had shaders come anywhere near the limit, not even resize or warp sharp. You can probably guess what I'm doing at this point, which may give you some clues as to why I'm trying to lock things down relatively conservatively.

Phaeron - 22 05 09 - 21:21

### Comment form

Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
 Name: Remember personal info? Yes No Email (Optional): Your email address is only revealed to the blog owner and is not shown to the public. URL (Optional): Comment: / An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam. Small print: All html tags except and will be removed from your comment. You can make links by just typing the url or mail-address.