Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Weighted averaging in SSE (part 2)

Last time I talked about a faster way to do parallel averaging between 8-bit components in SSE with unequal weights, specifically the [1 7]/8 case. This time, I'll improve on the [1 7] case and also show how to do the [3 5] case.

To recap, the idea centers around using the SSE pavgb instruction to stay within bytes by rounding off LSBs correctly each time we introduce a new intermediate result with an extra significant bit. Previously, I did this by converting a subtraction to an addition and using a correction factor. It turns out that there's another way to do this that doesn't require a correction factor and requires fewer constants, which means more temp registers available.

Recall that for the [1 7] case, we are trying to compute the expression:

r = round((a*7 + b)/8) = (a*7 + b + 4) >> 3

Given 8-bit input values, this gives a 11-bit intermediate result which must be rounded. Previously, I converted round((a*7 + b)/8) to (a + round((b-a)/8)), but it turns out that the first form is better. The reason why is that it can be decomposed as follows:

  round((a*7+b)/8)
= (a*7 + b + 4) >> 3
= (a*4 + a*2 + a + b + 4) >> 3
= (((((a + b) >> 1) + a) >> 1) + a + 1) >> 1
= average_round_up(average_round_down(average_round_down(a, b), a), a)

average_round_up() is, of course, simply pavgb. The average_round_down() is a bit more tricky. Ordinarily you could do this by add + shift, but we can't do that because the add gives us a 9-bit result and we'd immediately lose one of the bits. There is, unfortunately, no variant of pavgb that rounds down, since it was introduced to speed up MPEG motion compensation and that specifies rounding up on 0.5. What we can do, however, is use a sort of mutant of DeMorgan's Theorem and complement both inputs and outputs to pavgb, which converts its natural round-up behavior to round-down:

average_round_down(a, b) = ~average_round_up(~a, ~b) = ~pavgb(~a, ~b)

This might not seem advantageous since it adds three additional operations, but if we fold it into the last expression, some of the complements cancel and there are some common subexpressions:

  average_round_up(average_round_down(average_round_down(a, b), a), a)
= pavgb(average_round_down(~pavgb(~a, ~b), a), a)
= pavgb(~pavgb(~~pavgb(~a, ~b), ~a), a)
= pavgb(~pavgb(pavgb(~a, ~b), ~a), a)

Now we can write the SSE form:

pcmpeqb mm7, mm7
movd mm0, a
movd mm1, b
movq mm2, mm0
pxor mm0, mm7
pxor mm1, mm7
pavgb mm1, mm0
pavgb mm1, mm0
pxor mm1, mm7
pavgb mm1, mm2

Benchmarks show that this is faster than the previous method, but not by much (<10%). However, it only requires one constant register and three temporaries, so it is much easier to unroll this by 2x on x86 and 4x on x64 for a slight gain. And, of course, it's trivial to widen to 128-bit SSE2 operations.

Now, what about the [3 5] case? Well, it's basically the same thing, but the order of additions is a bit different:

  round((a*5+b*3)/8)
= ((a*5 + b*3) + 4) >> 3
= (a*4 + a + b*2 + b + 4) >> 3
= (((((a + b) >> 1) + b) >> 1) + a + 1) >> 1
= average_round_up(average_round_down(average_round_down(a, b), b), a)

Therefore, we can do the [3 5] case just by switching around which input is dominant in the second average:

pcmpeqb mm7, mm7
movd mm0, a
movd mm1, b
movq mm2, mm0
pxor mm0, mm7
pxor mm1, mm7
pavgb mm0, mm1
pavgb mm0, mm1
pxor mm0, mm7
pavgb mm0, mm2

In fact, if you think about it a little bit, this method can actually be generalized to any weighted sum where the factors add up to a power of two. It doesn't work if we have to add more than two values at any bit position, but because the weights add to a power of two, this never happens. Therefore, this method can be adapted to other useful weightings for chroma subsampling conversion, notably the 1/4, 3/4 case.

Comments

Comments posted:


Neat trick that ~pavgb(~a, ~b). So the 1/4, 3/4 case would be this :-
.
. round((a*3+b*1)/4)
. = ((a*3 + b*1) + 2) >> 2
. = (a*2 + a + b + 2) >> 2
. = (((a + b) >> 1) + a + 1) >> 1
. = average_round_up(average_round_down(a, b), a)
.
. pcmpeqb mm7, mm7
. movq mm0, a
. movq mm1, b
. movq mm2, mm0
. pxor mm0, mm7
. pxor mm1, mm7
. pavgb mm0, mm1
. pxor mm0, mm7
. pavgb mm0, mm2
.
(Any chance of repairing my post in the previous thread and deleteing the duplicate.)

IanB - 30 08 08 - 18:00


Yup, you got it. I went ahead and fixed your comment, too.

Now, what I haven't covered is how to speed up the _horizontal_ cases, which are also needed for chroma sampling conversion. That's because I haven't gotten around to it yet. What I'm doing right now is ramping up a new blitter library in VirtualDub known as "uberblit." It already shipped in 1.8.x as the engine behind the YCbCr capable resize filter, but I'm trying to swap it in as the main library so I can support a lot more formats like interlaced 4:2:0 and 32F (the existing engine is not scalable). Problem is, the uberblit library is currently a lot slower than the main blitter on some format combinations due to multipassing, so I've been working on speeding it up. Right now the benchmarks show it faster than the main blitter for 109/144 combinations and at least 90% as fast for 132/144 combinations. A lot of the slow paths that are left are for conversions that aren't very interesting, like Y8 to RGB555. The other trick is how to expose the new formats without breaking existing code, which could be annoying. Currently the main blitter uses a single format code, whereas uberblit uses colorspace/sampling/format bitfields, which invalidates table lookups.

Phaeron - 31 08 08 - 16:37


Oh how much I love your assembler ingeniousness. I would not remotely come up with this on my own. So lets see if I can understand and learn from it.

-SPM-Mad - 01 09 08 - 15:26


The pavgb round down trick is pretty neat! It's better than pavg(a,b)-(a&b&1). Somewhat related to this is: How do you use pavgb to calculate an signed average of two vectors containing packed signed bytes?

pshufb - 02 09 08 - 03:01


You can rebias a signed value to unsigned space by XORing with the sign bit (0x80 for bytes), so the trick I use here to round down also works for converting the unsigned average into a signed average. You can also combine the two by XORing with 7F in order to do a signed average with rounding down.

Note that in cases where an arithmetic right shift of the right size is available, it may be better to do the carry save trick I mentioned a while back: (a + b) >> 1 = (a & b) + ((a ^ b) >> 1) or (a + b + 1) >> 1 = (a | b) - ((a ^ b) >> 1). It's your only real option if you need to do averaging with just MMX, where you don't have pavgb.

Phaeron - 02 09 08 - 03:15

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.