Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Weighted averaging in SSE

I hereby declare today stupid assembly tricks day.

I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:

This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a  [5 3]/8.

Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:

 
movd mm0, [ecx]
movd mm1, [edx]
punpcklbw mm0, mm7
punpcklbw mm1, mm7
pmullw mm0, weight1
pmullw mm1, weight2
paddw mm0, mm1
paddw mm0, round
psraw mm0, FRACTIONAL_BITS
packuswb mm0, mm0
movd [ebx], mm0

This computes four samples in 11 instructions. It can easily be generalized to SSE2 simply by doubling the register widths. This is the baseline to beat.

So how do we speed this up?

The weighted sum itself is already decent. In SSE2, you can sometimes remove one of the additions by interleaving components and using PMADDWD instead of PMULLW x 2 + PADDW, but you then have an extra pack step from dword to word, and so I generally only do that if I need extra precision. Unpacking and packing is painful because it needs additional register space and it stresses the pack/shift unit, which is in short supply compared to addition units. No, in order to beat this significantly, we need to avoid the unpack / pack cycle, and that means working directly on bytes. Unfortunately, working directly on bytes in MMX/SSE has a few problems:

If you're confused about the last point, the reason why I bring it up has to do with precision issues in intermediate computations. In particular, if you add or subtract two full-range bytes (8 bits), you gain an extra significant bit and the result is 9 bits. Unless you use a widening operation like pmaddwd or pmaddubsw, you're going to lose a bit either on the high end (paddb/psubb), or on the low end (pavgb). Now, the trick here is that you can lose an LSB on one intermediate result and still come out with a correctly rounded answer, but lose a bit from two intermediate values, and you've usually lost accuracy. The trick I usually use is to convert (a*(1-x) + b*x) to (a + (b-a)*x), but (b-a) is a 9-bit signed result and that's kind of hard to deal with.

Fortunately, there's a way around this, at least for the [7 1] case. I've found that it helps to think more like a hardware engineer in cases like this: you need to concentrate on what the operations actually do and not what they're intended to be used for.

First, rewrite round((a*7 + b)/8) to (a + ((b-a+4) >> 3)). Now, we'd like to use pavgb here, but pavgb computes an addition ((a+b+1)>>1) and we need a subtraction. What we can do, however, is change addition into subtraction by an algorithm for 8-bit negation: -x == ~x + 1 - 0x100. Therefore, we can rewrite as follows:

  (b - a + 4) >> 3
= (b + ~a + 5 - 0x100) >> 3
= ((b + ~a + 5) >> 3) - 0x20

Next, we decompose the operation into individual bit shifts:

  ((b + ~a + 5) >> 3) - 0x20
= ((((b + ~a + 1) >> 1) + 2) >> 2) - 0x20
= (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20

This probably looks a bit crazy, but there is a reason for the madness, and that is that we can repeatedly convert ((a + b + 1) >> 1) to pavgb(a, b):

  (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20
= (((pavgb(b, ~a) >> 1) + 1) >> 1) - 0x20
= pavgb(pavgb(b, ~a) >> 1, 0) - 0x20

The result, translated to assembly:

movq  mm0, [ecx]

movq mm1, [ebx]

movq mm2, mm0

pxor mm0, xFFFFFFFFFFFFFFFF

pavgb mm0, mm1

pand mm0, xFEFEFEFEFEFEFEFE

psrlq mm0, 1

pavgb mm0, x0000000000000000

paddb mm2, xE0E0E0E0E0E0E0E0

paddb mm0, mm2

movq [edx], mm0

Note that while this is also 11 instructions, it processes 8 samples at a time instead of 4, and many of the instructions have lower latencies and fewer execution port restrictions than the unpack/mul/add/pack solution. Quick benchmarking shows the byte-oriented solution to be 40% faster, although admittedly for a really good comparison I'd need to convert both routines to SSE2 and double the number of pixels handled per loop. I think I can eliminate one of the constants by moving the complement and replacing pand + psrlq with another pavgb, but I haven't looked into that yet.

I haven't figured out how to handle the [5 3] case yet. It looks tougher due to the factor of three, but there might still be a way to structure the evaluation order to get all the rounding right. The question is whether or not it'll be faster than the straightforward solutions.

Comments

Comments posted:


This ASM is a little over my head, but still fun to read, keep up the good work P.

asf - 30 08 08 - 03:11


. movq mm0, [ecx] ; a
. movq mm1, [ebx] ; b
. pavgb mm1, mm0 ; (a+b+1)>>1
. pavgb mm0, mm1 ; (a+((a+b+1)>>1)+1)>>1
. pavgb mm0, mm1 ; ((a+((a+b+1)>>1)+1)>>1)+((a+b+1)>>1)+1)>>1
. movq [edx], mm0 ; (5*a + 3*b + 7)/8

The rounder is a little high but you could sub it out if you want.

IanB - 30 08 08 - 05:35


The biggest question you didn't ask: If you look at the code in half a year, do you still understand why it works? :)

Murmel - 30 08 08 - 15:35


Sigh... sometime I'll have to find out where in Pivot the lines are getting collapsed.

Yes, just doing pavgb repeatedly does end up with incorrect rounding. Fortunately, I figured out last night how to fix it without slowing things down too much. More on that in the next blog entry....

As for understanding the code later, that's what comments are for. :)

Phaeron - 30 08 08 - 15:45


Can't you use pmaddubsw to outperform both this *and* the bitmath methods in the followup post?

Interleave a and b as byte values, interleave the weights, and pmaddubsw. x264 uses this method for chroma motion compensation and weighted biprediction.

Dark Shikari - 06 01 09 - 12:32


PMADDUBSW requires SSSE3, whereas this method only requires SSE1 if done in MMX registers and SSE2 for XMM.

Also, in terms of operation count, using the mad unit isn't as cheap as you think, because you can only process 8 pairs at a time with PMADDUBSW. That means you're looking at MOVDQA + PUNPCKLBW + PUNPCKHBW + PMADDUBSW + PMADDUBSW + PACKUSWB. As it turns out, all of those instructions but the MOVDQA have a three-cycle latency with 128-bit ops on pre-Penryn Core 2s. The pack/unpack instructions also issue 3 uops each. PAVGB and PXOR, on the other hand, are all single uop and single clock latency. It's not at all clear which would be faster without profiling, and the scheduling of the surrounding code likely matters a lot too. On a Penryn core or later 128-bit pack/unpack are single uop, which changes things a bit, but they also still only issue to port 0. I'm guessing that for the relatively simple factors shown here, these methods will be fairly competitive when done with XMM registers.

Phaeron - 07 01 09 - 04:18


Phaeron, here's the numbers (dezicycles) for x264's avg_weight on a Core i7 (mmx/sse2 use the standard pmullw method):

avg_weight_16x16_mmx: 3112
avg_weight_16x16_sse2: 1823
avg_weight_16x16_ssse3: 1279

avg_weight does a weighted average between two sets of pixels.

mc_chroma_8x8_mmx: 1436
mc_chroma_8x8_sse2: 692
mc_chroma_8x8_ssse3: 527

mc chroma is an 8thpel bilinear filter--using coefficients matching those in this post for the various 8thpel positions (7/1, 6/2, 5/3, etc).

I imagine the method described in the blog post here would be particularly painful for this operation as it would require an unpredictable indirect jump (to one of 8 positions based on which average coefficients).

Dark Shikari - 07 01 09 - 17:35


You're probably right on the switching being expensive. I use this for bitmap interpolation rather than macroblock prediction, so my block sizes are much larger.

Note that you don't need 8 separate routines for something like this -- you can halve the number by swapping the averaging sources.

Phaeron - 07 01 09 - 23:31


Yeah, that's true as well. The naive approach would require 64 routines for all possible 8thpel locations, but you can drop that to 8 and then to 4 by just swapping the sources appropriately as you said.

One interesting thing to check would be how well these bitmath algorithms perform on the i7, which has every single "simple" operation (abs, sign, add, sub, bitmath) as 0.5 inverse throughput and 1 cycle latency, so it can do two per clock, twice as fast as the Penryn. Multiply is no faster than on Penryn, still 3/1.

Dark Shikari - 08 01 09 - 11:15


Speaking of which, do you have an IRC channel you hang out at or similar? I suspect x264 and Virtualdub could trade quite a number of nice assembly hacks, given that both have absurd numbers of asm functions.

Dark Shikari - 08 01 09 - 11:29


Naah. I left IRC a long time ago when I realized what a horrid soul-eating time sink it was. 'Sides, I've got a day job.

Open to email, though.

Phaeron - 09 01 09 - 00:28

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.