## § ¶SIMD averaging without SIMD hardware

If you need to average two sets of bitfields using larger word operations, there is a trick derived from a hardware algorithm for constructing adders that can come in handy. The basic comes from the design of a *carry-save adder*, which splits apart carry and sum signals to allow multiple values to be added with only one carry propagation pass at the end:

a+b = ((a & b) << 1) + (a ^ b)

(& is the C operator for bitwise AND, ^ is bitwise exclusive OR, and << is a left shift.)

To convert this into an average, propagate through the necessary shift:

(a+b) >> 1 = (a & b) + ((a ^ b) >> 1)

...and to apply this to bitfields, just mask off the least significant bits from each bitfield sum to avoid cross-field contamination:

(pel0+pel1) >> 1 = (pel0 & pel1) + (((pel0 ^ pel1) & 0xfefefefe)>>1)

This can be useful even if you have SIMD hardware that has built-in averaging. For instance, SSE has support for unsigned byte averages, but the above algorithm can be used to average four 565 pixels at a time by using the mask 0xf7def7def7def7de.

Note that the above algorithm always rounds down, which isn't always appropriate; MPEG motion prediction, for instance, requires rounding up. There is a trick that can be used to average up without using any additional operations.

The trick is to recognize that the addition can also be formulated using inclusive OR and subtraction:

a+b = ((a & b) << 1) + (a^b) = ((a|b) << 1) - (a^b)

Shifting out the LSB from the exclusive OR result then has the effect of rounding up instead of rounding down, giving:

(a+b+1) >> 1 = (a|b) - ((a^b)>>1)

(pel0+pel1+1)>>1 = (pel0 | pel1) - (((pel0 ^ pel1) & 0xfefefefe)>>1)

### Comments

**Comments posted:**

cool!

in the last line, you mean:

(pel0+pel1+1)>>1 = (pel0 | pel1) ***MINUS*** (((pel0 ^ pel1) & 0xfefefefe)>>1)

**User** - 06 07 06 - 04:45

How about compiling something like HAKMEM 2... :)

**[maven]** (link) - 06 07 06 - 06:57

@User:

Whoops, fixed. Thanks!

**Phaeron** - 06 07 06 - 23:34

Any smart-math way of doing an 8bit alpha blend (8bit source, 8bit destination, 8bit alpha)?

**Blight** - 07 07 06 - 01:04

Isn't it quite senseless to program without doing it 'the SIMD way'?

I mean, basicly every processor in existence today has some form of SIMD instruction set.

/Zaro

**Zaro** - 07 07 06 - 04:52

@Zaro, cleaverness like this is probably even more appropriate in SIMD processing. Think harder about the 565 pixel example Avery gave.

@Blight, I take it you, like me, want an improved way to do

Dst += (Src - Dst) * Alpha / 255;

which in regular code could be implemented quickish as

Dst += LookupTable[(Src-Dst)

**IanB** - 07 07 06 - 22:33

Doh!, Smartarse blog parser! Take 2

*Dst += LookupTable[(Src-Dst) ._lshift_. 8 | alpha]; // 17 bit table*

Which doesn't play well in SIMD.

Thoughts anybody?

**IanB** - 07 07 06 - 22:42

@Zaro:

You're assuming that the SIMD instruction set natively supports the datatypes you are working with. MMX and SSE are rather rigid in this regard, with some support for unsigned bytes, a lot of support for signed words, and very sketchy support for everything else. Conversion to and from signed words tends to put a lot of pressure on the shift unit, so to the extent that you can find alternate ways to work with odd bitfields, the better performance you can get. I hear that Altivec is a lot nicer with regard to data type and operation orthogonality, but it still isn't going to support every combination of packed data type that you will encounter in the wild.

**Phaeron** - 08 07 06 - 00:07