Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Implementing ELA in a pixel shader

I've been working on rewriting video algorithms in pixel shaders for 3D acceleration lately, and one of the sticking points I hit was Edge-Based Line Averaging (ELA) interpolation.

ELA is a common spatial interpolation algorithm for deinterlacing and works by trying several angles around the desired point and averaging between the points with the lowest absolute difference. The angles are chosen to be regular steps in sample location, i.e. (x+n, y-1) and (x-n, y+1) for n being small integers. This produces reasonable output for cases where a temporal or motion-based estimation is not available. The specific variant I'm dealing with is part of the Yadif deinterlacing algorithm, which checks three horizontally adjacent pixels for each angle and only picks the farthest two if the intermediate angle is a better match as well. In other words:

for dx = -2 to 2
    error[dx] = difference(top[dx - 1], bottom[-dx - 1]) + difference(top[dx], bottom[-dx]) + difference(top[dx + 1], bottom[-dx + 1])

best_offset = 0;
for dx = -1 to -2:
    if error[dx] < error[best_offset]:
        best_offset = dx
    else:
        break

for dx = +1 to +2:
    if error[dx] < error[best_offset]:
        best_offset = dx
    else:
        break

result = average(top[dx], bottom[dx])

This should be a relatively simple translation to a pixel shader -- convert each source pixel access to a texture sample. Not. It turns out that under the Direct3D ps_2_0 profile, which is what I need to target, there aren't enough temporary registers to run this algorithm. In order to run the algorithm, at least 14 source pixel fetches need to be done, and there are only 12 temp registers in ps2.0. The HLSL compiler valiantly tries to squeeze everything in and fails. Nuts.

There is an important caveat to this implementation tack, which is that I had source pixels mapped in AoS (array of structures) format, i.e. a single pixel held YCbCr components and an unused alpha channel. The CPU implementation of this algorithm, at least the way I wrote it in VirtualDub 1.9.1+, uses SoA (structures of arrays) orientation for speed. SoA arranges the data as planes of identical components, so instead of mixing components together you fetch a bunch of Y values across multiple pixels, a bunch of Cb pixels, and a bunch of Cr pixels, etc. I decided to try this in the pixel shader, since texture fetches were my main bottleneck. It looked something like this:

float4 top0 = tex2D(src, top_uv0);   // top Y 0-3
float4 top1 = tex2D(src, top_uv1);   // top Y 4-7
float4 top2 = tex2D(src, top_uv2);   // top Y 8-11
float4 bot0 = tex2D(src, bot_uv0);   // bottom Y 0-3
float4 bot1 = tex2D(src, bot_uv1);   // bottom Y 4-7
float4 bot2 = tex2D(src, bot_uv2);   // bottom Y 8-11

float4 error_left2 = abs(top0 - bot1)
    + abs(float4(top0.yzw, top1.x), float4(bot1.yzw, bot2.x))
    + abs(float4(top0.zw, top1.xy), float4(bot1.zw, bot2.xy));

Switching to SoA in a pixel shader nullifies some of the advantages of the GPU, since the GPU doesn't use as long vectors as fixed-point hardware (4x vs. SSE2's 16x), and because some GPU hardware doesn't directly support the swizzles you need to emulate a shift. It also largely nullifies the advantage of having texture samplers since you can no longer address the source by individual samples. Well, it turns out in this case that the extra swizzling made the situation even worse than in the AoS case, because the compiler didn't even get halfway down the shader before it gave up.

The main lesson here is that sampling textures can quickly become a bottleneck in the ps_2_0 profile. Just because you have 32 texture sampling instructions available doesn't mean you can use them. I've thought about switching to a higher pixel shader profile, like ps_2_a/b, but there are reasons I want to try to stay to ps_2_0, the main ones being the wide platform availability, the hard resource limits, and the omission of flow control and gradient constructs.

In the end, I had to split the ELA shader into two passes, one which just wrote out offsets to a temporary buffer and another pass that did the interpolation. It works, but the GPU version is only able to attain about 40 fps, whereas the CPU version can hit 60 fps with less than 100% of one core. I guess that mainly speaks to my lopsided computer spec more than anything else. That having been said, it kind of calls into question the "GPUs are much faster" idea. I have no doubt this would run tremendously faster on a GeForce 8800GTX, but it seems that there are plenty of GPUs out there where using the GPU isn't a guaranteed win over the CPU, even for algorithms that are fairly parallelizable.

Comments

Comments posted:


So, is shipping several versions for different profiles out of the question?

One popular game that I know of shipped with several shader archives. It looked the same on different cards, my guess is that it was mostly the very same shaders, compiled targeting different profiles for optimal performance. This makes sense since, while it didn't support 2.0, it did support 2.0b (increased instruction count but not control flow improvements IIRC) and then 2.0a (inc. inst. count and flow control).

It makes quite a lot of sense since things like dynamic branching can be a large performance win with no developer cost. In your case it's even more extreme, since it's 1 pass vs 2, but that would mean writing two code paths.

Also, 2.0 compatibility might not be worthy if it's going to run faster on software on these cards anyway, which is very likely for the GMA

John - 22 05 09 - 19:22


...which is very likely for the GMA less-than-or-equal 3100, which leaves just the Radeons 9500-9800.

John - 22 05 09 - 19:24


I've done the multiple profile tack in other situations, and it's something I'd like to avoid because it's a testing headache and requires a lot of 3D expertise. If I were to go that route, I'd probably just raise the minimum bar to 2.b/2.sw to 3.0, since cards that only support ps2.0 are probably too slow to use. They'll render at an acceptable rate, but the cost of reading back the result will cancel any benefits of using the GPU.

The main problems with allowing higher shader profiles is that they complicate translation to another shader language and increase the possibility that shaders may randomly fail to compile, due to the looser requirements. That's something I'd like to avoid.

ELA is probably one of the worst cases I've run into -- no other filter I've converted yet has had shaders come anywhere near the limit, not even resize or warp sharp. You can probably guess what I'm doing at this point, which may give you some clues as to why I'm trying to lock things down relatively conservatively.

Phaeron - 22 05 09 - 21:21

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.