v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum

Other projects
Altirra

## § ¶Weighted averaging in SSE

I hereby declare today stupid assembly tricks day.

I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:

• 7/8, 1/8
• 5/8, 3/8
• 3/8, 5/8
• 1/8, 7/8

This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a  [5 3]/8.

Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:

`movd mm0, [ecx]movd mm1, [edx]punpcklbw mm0, mm7punpcklbw mm1, mm7pmullw mm0, weight1pmullw mm1, weight2paddw mm0, mm1paddw mm0, roundpsraw mm0, FRACTIONAL_BITSpackuswb mm0, mm0movd [ebx], mm0`

This computes four samples in 11 instructions. It can easily be generalized to SSE2 simply by doubling the register widths. This is the baseline to beat.

So how do we speed this up?

The weighted sum itself is already decent. In SSE2, you can sometimes remove one of the additions by interleaving components and using PMADDWD instead of PMULLW x 2 + PADDW, but you then have an extra pack step from dword to word, and so I generally only do that if I need extra precision. Unpacking and packing is painful because it needs additional register space and it stresses the pack/shift unit, which is in short supply compared to addition units. No, in order to beat this significantly, we need to avoid the unpack / pack cycle, and that means working directly on bytes. Unfortunately, working directly on bytes in MMX/SSE has a few problems:

• There are no psrlb, psllb, or psrab instructions, thus extra masking operations are required.
• pcmpgtb only works on signed bytes.
• There are no multiply operations that work on bytes until SSSE3 (pmaddubsw).
• The SSE pavgb instruction only works on unsigned bytes.

If you're confused about the last point, the reason why I bring it up has to do with precision issues in intermediate computations. In particular, if you add or subtract two full-range bytes (8 bits), you gain an extra significant bit and the result is 9 bits. Unless you use a widening operation like pmaddwd or pmaddubsw, you're going to lose a bit either on the high end (paddb/psubb), or on the low end (pavgb). Now, the trick here is that you can lose an LSB on one intermediate result and still come out with a correctly rounded answer, but lose a bit from two intermediate values, and you've usually lost accuracy. The trick I usually use is to convert (a*(1-x) + b*x) to (a + (b-a)*x), but (b-a) is a 9-bit signed result and that's kind of hard to deal with.

Fortunately, there's a way around this, at least for the [7 1] case. I've found that it helps to think more like a hardware engineer in cases like this: you need to concentrate on what the operations actually do and not what they're intended to be used for.

First, rewrite round((a*7 + b)/8) to (a + ((b-a+4) >> 3)). Now, we'd like to use pavgb here, but pavgb computes an addition ((a+b+1)>>1) and we need a subtraction. What we can do, however, is change addition into subtraction by an algorithm for 8-bit negation: -x == ~x + 1 - 0x100. Therefore, we can rewrite as follows:

`  (b - a + 4) >> 3= (b + ~a + 5 - 0x100) >> 3= ((b + ~a + 5) >> 3) - 0x20`

Next, we decompose the operation into individual bit shifts:

`  ((b + ~a + 5) >> 3) - 0x20= ((((b + ~a + 1) >> 1) + 2) >> 2) - 0x20= (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20`

This probably looks a bit crazy, but there is a reason for the madness, and that is that we can repeatedly convert ((a + b + 1) >> 1) to pavgb(a, b):

`  (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20= (((pavgb(b, ~a) >> 1) + 1) >> 1) - 0x20= pavgb(pavgb(b, ~a) >> 1, 0) - 0x20`

The result, translated to assembly:

`movq  mm0, [ecx]movq  mm1, [ebx]movq  mm2, mm0pxor  mm0, xFFFFFFFFFFFFFFFFpavgb mm0, mm1pand  mm0, xFEFEFEFEFEFEFEFEpsrlq mm0, 1pavgb mm0, x0000000000000000paddb mm2, xE0E0E0E0E0E0E0E0paddb mm0, mm2movq  [edx], mm0`

Note that while this is also 11 instructions, it processes 8 samples at a time instead of 4, and many of the instructions have lower latencies and fewer execution port restrictions than the unpack/mul/add/pack solution. Quick benchmarking shows the byte-oriented solution to be 40% faster, although admittedly for a really good comparison I'd need to convert both routines to SSE2 and double the number of pixels handled per loop. I think I can eliminate one of the constants by moving the complement and replacing pand + psrlq with another pavgb, but I haven't looked into that yet.

I haven't figured out how to handle the [5 3] case yet. It looks tougher due to the factor of three, but there might still be a way to structure the evaluation order to get all the rounding right. The question is whether or not it'll be faster than the straightforward solutions.

This ASM is a little over my head, but still fun to read, keep up the good work P.

asf - 30 08 08 - 03:11

. movq mm0, [ecx] ; a
. movq mm1, [ebx] ; b
. pavgb mm1, mm0 ; (a+b+1)>>1
. pavgb mm0, mm1 ; (a+((a+b+1)>>1)+1)>>1
. pavgb mm0, mm1 ; ((a+((a+b+1)>>1)+1)>>1)+((a+b+1)>>1)+1)>>1
. movq [edx], mm0 ; (5*a + 3*b + 7)/8

The rounder is a little high but you could sub it out if you want.

IanB - 30 08 08 - 05:35

The biggest question you didn't ask: If you look at the code in half a year, do you still understand why it works? :)

Murmel - 30 08 08 - 15:35

Sigh... sometime I'll have to find out where in Pivot the lines are getting collapsed.

Yes, just doing pavgb repeatedly does end up with incorrect rounding. Fortunately, I figured out last night how to fix it without slowing things down too much. More on that in the next blog entry....

As for understanding the code later, that's what comments are for. :)

Phaeron - 30 08 08 - 15:45

Can't you use pmaddubsw to outperform both this *and* the bitmath methods in the followup post?

Interleave a and b as byte values, interleave the weights, and pmaddubsw. x264 uses this method for chroma motion compensation and weighted biprediction.

Dark Shikari - 06 01 09 - 12:32

PMADDUBSW requires SSSE3, whereas this method only requires SSE1 if done in MMX registers and SSE2 for XMM.

Also, in terms of operation count, using the mad unit isn't as cheap as you think, because you can only process 8 pairs at a time with PMADDUBSW. That means you're looking at MOVDQA + PUNPCKLBW + PUNPCKHBW + PMADDUBSW + PMADDUBSW + PACKUSWB. As it turns out, all of those instructions but the MOVDQA have a three-cycle latency with 128-bit ops on pre-Penryn Core 2s. The pack/unpack instructions also issue 3 uops each. PAVGB and PXOR, on the other hand, are all single uop and single clock latency. It's not at all clear which would be faster without profiling, and the scheduling of the surrounding code likely matters a lot too. On a Penryn core or later 128-bit pack/unpack are single uop, which changes things a bit, but they also still only issue to port 0. I'm guessing that for the relatively simple factors shown here, these methods will be fairly competitive when done with XMM registers.

Phaeron - 07 01 09 - 04:18

Phaeron, here's the numbers (dezicycles) for x264's avg_weight on a Core i7 (mmx/sse2 use the standard pmullw method):

avg_weight_16x16_mmx: 3112
avg_weight_16x16_sse2: 1823
avg_weight_16x16_ssse3: 1279

avg_weight does a weighted average between two sets of pixels.

mc_chroma_8x8_mmx: 1436
mc_chroma_8x8_sse2: 692
mc_chroma_8x8_ssse3: 527

mc chroma is an 8thpel bilinear filter--using coefficients matching those in this post for the various 8thpel positions (7/1, 6/2, 5/3, etc).

I imagine the method described in the blog post here would be particularly painful for this operation as it would require an unpredictable indirect jump (to one of 8 positions based on which average coefficients).

Dark Shikari - 07 01 09 - 17:35

You're probably right on the switching being expensive. I use this for bitmap interpolation rather than macroblock prediction, so my block sizes are much larger.

Note that you don't need 8 separate routines for something like this -- you can halve the number by swapping the averaging sources.

Phaeron - 07 01 09 - 23:31

Yeah, that's true as well. The naive approach would require 64 routines for all possible 8thpel locations, but you can drop that to 8 and then to 4 by just swapping the sources appropriately as you said.

One interesting thing to check would be how well these bitmath algorithms perform on the i7, which has every single "simple" operation (abs, sign, add, sub, bitmath) as 0.5 inverse throughput and 1 cycle latency, so it can do two per clock, twice as fast as the Penryn. Multiply is no faster than on Penryn, still 3/1.

Dark Shikari - 08 01 09 - 11:15

Speaking of which, do you have an IRC channel you hang out at or similar? I suspect x264 and Virtualdub could trade quite a number of nice assembly hacks, given that both have absurd numbers of asm functions.

Dark Shikari - 08 01 09 - 11:29

Naah. I left IRC a long time ago when I realized what a horrid soul-eating time sink it was. 'Sides, I've got a day job.

Open to email, though.

Phaeron - 09 01 09 - 00:28