## § ¶Optimizing a FIR filter routine

Recently I had to implement a low-pass audio filter in software. A low-pass filter is so named because it passes low frequencies while muting high ones, similar to what you'd get by turning treble all the way down on a stereo. Low-pass filters have a number of uses, the particular use in this case being to prevent aliasing in a subsequent resampling pass.

There are many ways to implement a low-pass filter, but the method that I used was a finite impulse response (FIR) filter. FIR filters have a few advantages, such as simplicity of implementation in software and ease of making linear-phase filters. The cutoff frequency was fairly high, so the FIR filter kernel didn't need that many taps -- a 15 tap symmetric filter was enough.

To retell the tale, let's start with this routine:

void filter(float *dst, const float *src, size_t n, const float *kernel) {

const float k0 = kernel[0];

const float k1 = kernel[1];

const float k2 = kernel[2];

const float k3 = kernel[3];

const float k4 = kernel[4];

const float k5 = kernel[5];

const float k6 = kernel[6];

const float k7 = kernel[7];do {

float v = src[7] * k0

+ (src[ 6] + src[ 8]) * k1

+ (src[ 5] + src[ 9]) * k2

+ (src[ 4] + src[10]) * k3

+ (src[ 3] + src[11]) * k4

+ (src[ 2] + src[12]) * k5

+ (src[ 1] + src[13]) * k6

+ (src[ 0] + src[14]) * k7;++src;*dst++ = v;

} while(--n);

}

So... how fast does it run? Well, compiled using Visual C++ 2010 and run on a Core 2, the timing is about 125,000 cycles for 4,096 samples, or 30.5 cycles/sample. Not too surprising, considering that there are 22 elementary operations per sample (14 additions, 8 multiplications). This is with cache effects mostly discounted, as I test routines like this by running them 10 times in a loop and taking the minimum time -- cache effects are important, but they're too often noise when profiling a routine like this and can make it hard to compare routines. Therefore, all of these timings are ideal timings with regard to memory delays.

Another thing you might notice is the lack of __restrict; I'm omitting them for brevity here, although they were used on the timings. I know why the restrict keyword exists and how it helps optimization, but truth be told, I've never seen a single case where the Visual C++ compiler took advantage of it, no matter how many times I've tried it.

**Alright, that's the easy case....**

It's not terribly useful to benchmark a routine on a fast system, so we're going to try something harder: my trusty little Atom-based netbook. Survey says: 442,000 cycles, or 108 cycles/sample. Ugh. Well, we know that an Atom isn't as fast as a Core 2, so this isn't surprising. However, we can do better than this. First stop is to look at the disassembly:

fld dword ptr [eax+8]

add eax,4

fadd dword ptr [eax-4]

add ecx,4

dec edx

fmul st,st(4)

fld dword ptr [eax]

fmul st,st(6)

faddp st(1),st

fld dword ptr [eax+8]

fadd dword ptr [eax-8]

fmul st,st(4)

faddp st(1),st

fld dword ptr [eax+0Ch]

fadd dword ptr [eax-0Ch]

fmul st,st(3)

faddp st(1),st

fld dword ptr [eax+10h]

fadd dword ptr [eax-10h]

fmul st,st(2)

faddp st(1),st

fld dword ptr [eax+14h]

fadd dword ptr [eax-14h]

fmul dword ptr [esp+10h]

faddp st(1),st

fld dword ptr [eax+18h]

fadd dword ptr [eax-18h]

fmul dword ptr [esp+14h]

faddp st(1),st

fld dword ptr [eax+1Ch]

fadd dword ptr [eax-1Ch]

fmul dword ptr [esp+18h]

faddp st(1),st

fstp dword ptr [esp+2Ch]

fld dword ptr [esp+2Ch]

fstp dword ptr [ecx-4]

jne 0000005F

Not too bad, but there are some obvious issues here. In my experience, Visual C++ tends not to interleave calculations and instead prioritizes clustering related operations together. This is not a bad idea for a CPU like the Core 2, which has both out-of-order execution and an anemic lack of general purpose registers. However, it's not great for the in-order Atom because it means that the whole routine is full of stalls, since nothing can run in parallel. We need to do something about this.

The most obvious thing to do would be to try to reorder the instructions to gain some parallelism. I won't spend time here doing so, but suffice it to say that I tried this and it executed even more poorly than the straightforward code above. A bit of checking with Agner Fog's optimization tome reveals that the Atom has two big problems in this area, one being pathological scheduling with back-to-back x87 instructions and another being a non-free cost for the FXCH instruction. The Core 2, of course, doesn't mind and actually likes the rescheduled instruction stream better. We know that it's possible to do better on both CPUs, though, so it's not worth wasting any more time with x87.

**Time for SSE**

You might wonder why I didn't try throwing /arch:SSE on the compiler. Well, I did, but the x86 version of Visual C++ is generally reluctant to use SSE scalar math instructions even when you tell it to, and refused to recompile the inner loop as such. Part of the reason is that it's constrained by the ABI, which requires returning parameters in st(0), and I suspect it's also partly due to some older CPUs that executed x87 better than scalar SSE. Therefore, we'll have to go to assembly for this one. The good news is that in SSE we don't have to deal with a stupid register stack:

movss xmm0, [ecx+7*4]movss xmm1, [ecx+6*4]

mulss xmm0, [edx+0*4]movss xmm2, [ecx+5*4]

addss xmm1, [ecx+8*4]movss xmm3, [ecx+4*4]

addss xmm2, [ecx+9*4]

mulss xmm1, [edx+1*4]movss xmm4, [ecx+3*4]

addss xmm3, [ecx+10*4]

mulss xmm2, [edx+2*4]

addss xmm0, xmm1movss xmm5, [ecx+2*4]

addss xmm4, [ecx+11*4]

mulss xmm3, [edx+3*4]

addss xmm0, xmm2movss xmm6, [ecx+1*4]

addss xmm5, [ecx+12*4]

mulss xmm4, [edx+4*4]

addss xmm0, xmm3movss xmm7, [ecx+0*4]

addss xmm6, [ecx+13*4]

mulss xmm5, [edx+5*4]

addss xmm0, xmm4mulss xmm6, [edx+6*4]

addss xmm0, xmm5addss xmm7, [ecx+14*4]

addss xmm0, xmm6mulss xmm7, [edx+7*4]

add ecx, 4addss xmm0, xmm7movss [ebx], xmm0

add ebx, 4dec eax

jne xloop

Core 2 likes this about the same as the interleaved x87 asm routine at about 88,000 cycles (21.5 cycles/sample). Atom appreciates it a lot more, however, and grinds through it twice as fast at 200,000 cycles (49 cycles/sample). Not bad.

So we've established that scalar SSE floating-point runs much, much better on Atom than x87. I guess the hardware designers are sick of the register stack, too. We're still doing everything in scalar math, however, so next stop is vectorization.

**Vectorizing a FIR routine**

If you take a look at the original routine again you'll notice that I took advantage of the symmetric filter kernel and added symmetric taps together before scaling by the common weighting factors. Well, this doesn't work as well for a vectorized routine, because the symmetry is around a single tap, and when dealing with vectors that means some annoying shuffling to get everything in place. In this case, it seems more trouble than it's worth, so we'll ditch it and just evaluate the filter kernel without the symmetry optimization. This means double the multiplications, but those are about the same cost as the shuffles would be anyway as long as we keep the pipelines fed.

That leaves the question of how to evaluate the filter. Ordinarily I construct FIR loops to read from a sliding window, weighting input samples to produce a single output sample. However, for vectorized routines I've warmed up to the alternate strategy of inverting the filter and accumulating output samples into a sliding window instead. There are two reasons to do this: (1) read/modify/write is no more costly if you can keep the destination in a register, and more importantly (2) it avoids the need for expensive horizontal adds.

First stab, using intrinsics:

__m128 zero = _mm_setzero_ps();

__m128 x0 = zero;

__m128 x1 = zero;

__m128 x2 = zero;

__m128 x3 = zero;

__m128 f0;

__m128 f1;

__m128 f2;

__m128 f3;// init filter

__m128 k0 = _mm_loadu_ps(kernel + 0);

__m128 k1 = _mm_loadu_ps(kernel + 4);f0 = _mm_shuffle_ps(k1, k1, _MM_SHUFFLE(0, 1, 2, 3));

f1 = _mm_shuffle_ps(k0, k0, _MM_SHUFFLE(0, 1, 2, 3));

f2 = _mm_move_ss(k0, k1);

f2 = _mm_shuffle_ps(f2, f2, _MM_SHUFFLE(0, 3, 2, 1));

f3 = _mm_move_ss(k1, zero);

f3 = _mm_shuffle_ps(f3, f3, _MM_SHUFFLE(0, 3, 2, 1));// prime

for(int i=0; i<14; ++i) {

x0 = _mm_move_ss(x0, x1);

x0 = _mm_shuffle_ps(x0, x0, _MM_SHUFFLE(0, 3, 2, 1));

x1 = _mm_move_ss(x1, x2);

x1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(0, 3, 2, 1));

x2 = _mm_move_ss(x2, x3);

x2 = _mm_shuffle_ps(x2, x2, _MM_SHUFFLE(0, 3, 2, 1));

x3 = _mm_move_ss(x3, zero);

x3 = _mm_shuffle_ps(x3, x3, _MM_SHUFFLE(0, 3, 2, 1));

__m128 s = _mm_load1_ps(src++);

x0 = _mm_add_ps(x0, _mm_mul_ps(f0, s));

x1 = _mm_add_ps(x1, _mm_mul_ps(f1, s));

x2 = _mm_add_ps(x2, _mm_mul_ps(f2, s));

x3 = _mm_add_ps(x3, _mm_mul_ps(f3, s));

}// pipeline

do {

x0 = _mm_move_ss(x0, x1);

x0 = _mm_shuffle_ps(x0, x0, _MM_SHUFFLE(0, 3, 2, 1));

x1 = _mm_move_ss(x1, x2);

x1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(0, 3, 2, 1));

x2 = _mm_move_ss(x2, x3);

x2 = _mm_shuffle_ps(x2, x2, _MM_SHUFFLE(0, 3, 2, 1));

x3 = _mm_move_ss(x3, zero);

x3 = _mm_shuffle_ps(x3, x3, _MM_SHUFFLE(0, 3, 2, 1));

__m128 s = _mm_load1_ps(src++);

x0 = _mm_add_ps(x0, _mm_mul_ps(f0, s));

x1 = _mm_add_ps(x1, _mm_mul_ps(f1, s));

x2 = _mm_add_ps(x2, _mm_mul_ps(f2, s));

x3 = _mm_add_ps(x3, _mm_mul_ps(f3, s));_mm_store_ss(dst++, x0);

} while(--n);

This is a bit more obscure than the original routine, so a bit of explanation: x0-x3 form a 16 output sample pipeline where we shift in zeros at the high end (x3.w) and shift out samples at the low end (x0.x). At the same time, we bring in input samples one at a time and accumulate its contribution to each of the 16 output samples according to the input sample's relative location to the output sample. This is done simply by splatting out the input sample 16 times and scaling that vector by the reversed kernel (which is the same, since it's symmetric). Each turn of the crank merges one input sample, shifts the window over one, and writes out one output sample. There is one other minor gotcha, which is that we have to prime the pipe with 14 input samples first -- but that's just the normal loop with the output sample discarded.

Core 2 clocks in at 66,000 cycles (16.1 cycles/sample), Atom at 141,000 cycles (34.4 cycles/sample). That looks a bit better. We're now at about 2-2.5x over the original scalar code.

**Problem in the machine code**

If we dump the disassembly generated by VC10, there's a problem:

movss xmm6,xmm1

movss xmm3,xmm6

movss xmm6,xmm2

movss xmm1,xmm6

movss xmm6,xmm0

movss xmm2,xmm6

movss xmm6,xmm4

movss xmm0,xmm6

movaps xmmword ptr [esp+10h],xmm0

movss xmm0,dword ptr [eax]

shufps xmm0,xmm0,0

movaps xmm6,xmm0

mulps xmm6,xmm5

shufps xmm3,xmm3,39h

addps xmm3,xmm6

movaps xmm6,xmm0

mulps xmm6,xmmword ptr [esp+20h]

shufps xmm1,xmm1,39h

addps xmm1,xmm6

movaps xmm6,xmm0

mulps xmm6,xmmword ptr [esp+30h]

mulps xmm0,xmmword ptr [esp+40h]

shufps xmm2,xmm2,39h

addps xmm2,xmm6

movaps xmm6,xmmword ptr [esp+10h]

mov ecx,edx

add eax,4

add edx,4

dec esi

shufps xmm6,xmm6,39h

addps xmm0,xmm6

movss dword ptr [ecx],xmm3

jne 00000330

What the heck is going on with the movss instructions at the top of the loop? Not too bash the VC++ team too much, but unnecessary register-to-register moves have been an issue with intrinsics in VC++ all the way back to the VC6 Processor Pack, and it's astonishing that even with the renewed focus on intrinsics code generation in VC10 the problem still hasn't been solved (bug 556347). Sigh. Back to assembly it is:

movss xmm7, [ecx]

movss xmm0, xmm1

shufps xmm7, xmm7, 0movaps xmm4, [esp]

movss xmm1, xmm2

shufps xmm0, xmm0, 00111001bmovaps xmm5, [esp+16]

movss xmm2, xmm3

shufps xmm1, xmm1, 00111001bpxor xmm6, xmm6

add ecx, 4

shufps xmm2, xmm2, 00111001bmovss xmm3, xmm6

movaps xmm6, [esp+32]

mulps xmm4, xmm7shufps xmm3, xmm3, 00111001b

mulps xmm5, xmm7

mulps xmm6, xmm7mulps xmm7, [esp+48]

addps xmm0, xmm4

addps xmm1, xmm5

addps xmm2, xmm6

addps xmm3, xmm7movss [edx], xmm0

add edx, 4dec eax

jne xloop

Pipeline is in XMM0-XMM3, source pointer is ECX, dest pointer is EDX. I've omitted setup code here, but don't worry, I'll provide it later. Results: Core 2 takes 40,000 cycles (9.8 cycles/sample), Atom takes 89,000 cycles (21.8 cycles/sample). That's about a third faster than the compiler.

**Scheduling for Atom**

Can we get this lower? Definitely, at least for the Atom. Specifically, there are some stalls in the above code that are causing problems:

- MULPS can only be issued in pipe 0 every two cycles, and has a five cycle latency.
- ADDPS can only be issued in pipe 1, and has a five cycle latency.

Therefore, we've got to reorder the instructions to clear the stalls. Brace yourself, this is going to be messy:

movss xmm7, [ecx]

movss xmm0, xmm1shufps xmm7, xmm7, 0mulps xmm4, xmm7

movss xmm1, xmm2shufps xmm0, xmm0, 00111001b

add ecx, 4mulps xmm5, xmm7

movss xmm2, xmm3shufps xmm1, xmm1, 00111001b

movaps xmm6, xmm7mulps xmm6, [esp+32]shufps xmm2, xmm2, 00111001b

addps xmm0, xmm4mulps xmm7, [esp+48]

add edx, 4psrldq xmm3, 4

addps xmm1, xmm5movaps xmm4, [esp]

dec eaxmovaps xmm5, [esp+16]

addps xmm2, xmm6movss [edx-4], xmm0addps xmm3, xmm7

jne xloop

I've added a single SSE2 instruction, PSRLDQ, to avoid an issue with register pressure (couldn't spare a register for zero). It's worth noting that while the Atom is capable of running two instructions per cycle at peak, there are unused instruction slots in the loop above. The reason is that the loop is bottlenecked on instructions that can only execute in the first pipe, so there's no possible way to drop a clock by reordering instructions alone. This is because pipe 0 handles too many types of instructions we need here: load/store, shift, and vector multiply. It would be easier if load/store instructions executed in pipe 1, but the problem is that would likely make MULPS xmm, m128 instructions unpairable, which would have made it a wash here.

Incidentally, Agner's tome says there is a 4-5 clock latency to read memory for vector instructions on the Atom, but I wasn't seeing that at all on loads. Good thing, too, because it would have made the code run much slower.

This runs at about the same speed on Core 2, but on Atom, it runs another third faster: 62,000 cycles, or 15 cycles/sample. By my estimates the code should actually run at 14 cycles/sample, but I have been unable to eliminate the last cycle for some reason -- it could be that branches on Atom are two cycles, in which case 15 cycles would be optimal for this instruction mix. This is about the fastest I've been able to get the routine so far, for an improvement of 3x over the scalar code on Core 2 and 7x on Atom.

**Left as exercises for the reader**

If you want to play with the code: http://www.virtualdub.org/downloads/fir.zip. It's not polished, it doesn't check CPU flags, and it doesn't even have a project file. I trust you can deal with this, of course. It works on VS2005 and VS2010.

The routine would likely run faster if written in x64, as with the extra registers some of the load/stores could be eliminated and the loop unrolled. To bad I'm still stuck in x86 land. :P

I tried an SSSE3 version to take advantage of PALIGNR instead of MOVSS + SHUFPS, but it wasn't as advantageous as I had hoped -- 10% faster on Core 2, but 7% slower on Atom. One reason is that PALIGNR is a shift instruction, so on Atom it still requires pipe 0, while MOVSS is a cheap instruction that can fit in a pipe 1 hole. Another reason is that PALIGNR, like SHUFPS, has the source and destination registers swapped from the way you would expect. In this case, it forces the pipeline to be run in the opposite direction with left shifts, which then requires extra instructions to pull the output sample from the top of a register. I tried EXTRACTPS to do this, until I found out the hard way -- via a crash -- that it's an SSE4.1 instruction. Oops.

Best I can tell, the only way to get a major speedup from this code on x86 would be to ditch floating-point and go back to good old fashioned 16-bit fixed point. Fixed-point code is still viable in many cases because integer instructions can often execute in more pipes and with lower latencies. Going to fixed point would mean switching the algorithm back to the original style of a sliding source window, as the multiplies widen values from 16-bit to 32-bit, and we'd need to take advantage of PMADDWD in order to get a speedup over the FP code. The downside is that we'd be back to doing a horizontal add at the end, and that would be tricky to optimize and interleave.

Final note: The level of optimization achieved here is a bit of overkill in that the original application runs this on a real-time, stereo 64KHz audio stream. If you crunch the numbers, that's 0.8% of a 1.6GHz Atom even with the original scalar code, and at a 10% CPU target for battery life and heat reasons, that's still only 8% of total CPU usage. Truth be told, I got a little carried away again optimizing this thing, but I figure that I might as well take the 3-7x improvement now that I have it.