## § ¶Weighted averaging in SSE

I hereby declare today stupid assembly tricks day.

I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:

- 7/8, 1/8
- 5/8, 3/8
- 3/8, 5/8
- 1/8, 7/8

This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a [5 3]/8.

Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:

movd mm0, [ecx]

movd mm1, [edx]

punpcklbw mm0, mm7

punpcklbw mm1, mm7

pmullw mm0, weight1

pmullw mm1, weight2

paddw mm0, mm1

paddw mm0, round

psraw mm0, FRACTIONAL_BITS

packuswb mm0, mm0

movd [ebx], mm0

This computes four samples in 11 instructions. It can easily be generalized to SSE2 simply by doubling the register widths. This is the baseline to beat.

So how do we speed this up?

The weighted sum itself is already decent. In SSE2, you can sometimes remove one of the additions by interleaving components and using PMADDWD instead of PMULLW x 2 + PADDW, but you then have an extra pack step from dword to word, and so I generally only do that if I need extra precision. Unpacking and packing is painful because it needs additional register space and it stresses the pack/shift unit, which is in short supply compared to addition units. No, in order to beat this significantly, we need to avoid the unpack / pack cycle, and that means working directly on bytes. Unfortunately, working directly on bytes in MMX/SSE has a few problems:

- There are no psrlb, psllb, or psrab instructions, thus extra masking operations are required.
- pcmpgtb only works on
*signed*bytes. - There are no multiply operations that work on bytes until SSSE3 (pmaddubsw).
- The SSE pavgb instruction only works on
*unsigned*bytes.

If you're confused about the last point, the reason why I bring it up has to do with precision issues in intermediate computations. In particular, if you add or subtract two full-range bytes (8 bits), you gain an extra significant bit and the result is 9 bits. Unless you use a widening operation like pmaddwd or pmaddubsw, you're going to lose a bit either on the high end (paddb/psubb), or on the low end (pavgb). Now, the trick here is that you can lose an LSB on *one* intermediate result and still come out with a correctly rounded answer, but lose a bit from two intermediate values, and you've usually lost accuracy. The trick I usually use is to convert (a*(1-x) + b*x) to (a + (b-a)*x), but (b-a) is a 9-bit signed result and that's kind of hard to deal with.

Fortunately, there's a way around this, at least for the [7 1] case. I've found that it helps to think more like a hardware engineer in cases like this: you need to concentrate on what the operations actually *do* and not what they're *intended* to be used for.

First, rewrite round((a*7 + b)/8) to (a + ((b-a+4) >> 3)). Now, we'd like to use pavgb here, but pavgb computes an addition ((a+b+1)>>1) and we need a subtraction. What we can do, however, is change addition into subtraction by an algorithm for 8-bit negation: -x == ~x + 1 - 0x100. Therefore, we can rewrite as follows:

(b - a + 4) >> 3

= (b + ~a + 5 - 0x100) >> 3

= ((b + ~a + 5) >> 3) - 0x20

Next, we decompose the operation into individual bit shifts:

((b + ~a + 5) >> 3) - 0x20

= ((((b + ~a + 1) >> 1) + 2) >> 2) - 0x20

= (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20

This probably looks a bit crazy, but there is a reason for the madness, and that is that we can repeatedly convert ((a + b + 1) >> 1) to pavgb(a, b):

(((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20

= (((pavgb(b, ~a) >> 1) + 1) >> 1) - 0x20

= pavgb(pavgb(b, ~a) >> 1, 0) - 0x20

The result, translated to assembly:

movq mm0, [ecx]movq mm1, [ebx]

movq mm2, mm0

pxor mm0, xFFFFFFFFFFFFFFFF

pavgb mm0, mm1

pand mm0, xFEFEFEFEFEFEFEFE

psrlq mm0, 1

pavgb mm0, x0000000000000000

paddb mm2, xE0E0E0E0E0E0E0E0

paddb mm0, mm2

movq [edx], mm0

Note that while this is also 11 instructions, it processes 8 samples at a time instead of 4, and many of the instructions have lower latencies and fewer execution port restrictions than the unpack/mul/add/pack solution. Quick benchmarking shows the byte-oriented solution to be 40% faster, although admittedly for a really good comparison I'd need to convert both routines to SSE2 and double the number of pixels handled per loop. I think I can eliminate one of the constants by moving the complement and replacing pand + psrlq with another pavgb, but I haven't looked into that yet.

I haven't figured out how to handle the [5 3] case yet. It looks tougher due to the factor of three, but there might still be a way to structure the evaluation order to get all the rounding right. The question is whether or not it'll be faster than the straightforward solutions.