Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ SSE4.1, first impressions

Trimbo's been bugging me about SSE4.1 lately and my experiences with it.

Well, I've just starting playing around with it, now that I have the tools and build set up, and my experience has been mixed.

The main problem is alignment. To make good use of a Core 2+ and of SSE4.1, you have to go from mmx (64-bit) to xmm (128-bit) registers. The annoyance that comes along with this is that while 64-bit memory accesses can be unaligned, all 128-bit accesses on x86 currently have to be aligned or you fault. That means it isn't as trivial as just taking a 64-bit loop and processing pairs of pixels at a time. It's true that misaligned loads hurt performance, but there are two mitigating factors in 64-bit vector code. One is that on modern CPUs, not all misaligned accesses cause a penalty -- only ones that cross L1 cache line boundaries and trigger a DCU cache split do. This means that for sequential accesses only a quarter or less of your loads fault, which may be acceptable if the cost of avoiding the misalignment is higher. The other factor is that when processing pixels, it's common to have to expand 8-bit unsigned channels to 16-bit signed, which means that the loads are frequently 32-bit and thus may always be aligned. Going to 128-bit vectors and 64-bit loads spoils this. What this all means is that several algorithms that I looked at for rewriting in SSE4.1 looked great until I examined the load/store paths and realized that I was going to burn more time in dealing with misalignment than I would save in the faster ALU operations.

You might say that memory buffers should just be aligned, and yes, you can do that to some extent, particularly with temporary buffers. The gotcha is that you don't always control the buffers involved and at some point you simply can't control the alignment. For example, display buffers probably aren't guaranteed to be 16 byte aligned, nor would a GDI DIB. Similar problems occur at the ends of scanlines, related to the width of the image; it's lame and inflexible to just have your library require that all working images be multiples of 4/8/16 pixels. Working with .NET? Oh, sorry, the GC heap doesn't support alignment at all -- screwed. The compromise that I generally shoot for is to get a routine that will work with odd widths or non-optimal alignment, although it might be a little slower due to fixup.

There are also cases that simply don't scale to larger vectors. For example, in a scaling or rotation algorithm, I might need to pull 32-bit pixels at various locations and expand them to 64-bit for processing. What am I going to do with 128-bit vectors? I can't pull pairs of pixels from each location, because I only need one. I could process more pixels in parallel, except that I only have eight registers and having four source pointers is hard enough as it is. It's more doable in long mode with 16 GPRs, but I haven't even gotten to that yet.

In terms of instruction selection, the SSE4.1 instruction that seems most useful so far is PMOVZXBW, because it replaces the load/move+PUNPCKLBW that commonly use without eating an extra register for zero. PBLENDW is also looking useful for some alignment scenarios. Other than that, most of the other instructions I think I can abuse are actually from SSSE3, because I'm not that interested in the floating point part of SSE4.1 for VirtualDub. In SSSE3, PHADDD (packed horizontal add doubleword) is turning out quite useful, because I frequently do high precision integer dot products and that means PMADDWD followed by a horizontal add. PSHUFB is also promising, especially given the high speed on SSE4.1-capable CPUs, but that it requires the shuffle pattern to be in memory or in a register and that it works in-place are annoying. PALIGNR looks useful but often requires unrolling due to the immediate argument.

The 8-bit resamplers -- which are used in 1.8.0 when using the "resize" filter with YCbCr video -- got about 20-25% faster with SSE4.1 in my first attempt compared to the MMX version. Unfortunately, I don't know how much this is due to SSE4.1, or just due to the move to 128-bit vectors, since Core 2 is twice as fast at those and Enhanced Core 2 is even faster. I have some ideas for abusing PBLENDW and PSHUFB as well for optimizing conversion between RGB24 and RGB32, but the alignment issue is a bear. I've also been thinking about whether I can speed up the RGB<->YCbCr converters, but PMADDUBSW is the most promising there and the coefficient precision there would be marginal. I also got the idea to abuse MPSADBW for a fast box filter, although the fixed radius would be a bit restrictive and it'd only help horizontally, and I'm not sure what I would use it for.

Overall, I'm not seeing a revolution here compared to what I was doing with MMX and SSE2, but it is a bit nicer overall -- I'm not spending as much time doing register-to-register moves and unpacks as I did. My guess is that if you already have a CPU that's at least SSSE3 capable (Core 2), then you're already going to get most of the benefits instruction-wise, with the difference you're missing being in the microarchitecture and not in the lack of SSE4.1. I'm also beginning to see some strengths and weaknesses of SSSE3/SSE4.1 against AMD's SSE5, at least for image processing. The data movement capabilities of SSSE3/SSE4.1 look superior, but SSE5 has some really compelling ALU operations: PMADCSSWD (packed multiply, add and accumulate signed word to signed doubleword with saturation) looks perfect for what I do. The main question is how fast AMD can get it. I'd heard that the fused-multiply-add unit in Altivec-capable PPC chips was a problem in terms of gating clock speed and the solution was to pipeline it to the point that it became less compelling due to latency; we'll see what happens with SSE5.

Comments

Comments posted:


I think the blend instructions were ment to replace the common branch elimination code (compare a b => (a & ~mask) | (b & mask)), or at least that's how I use it.

Gabest - 26 04 08 - 21:25


If you use PBLENDVB... the other use is to emulate destination write masks in vertex shader code (add r0.yw, r1, r2). But yeah, it's faster than using three instructions (pxor+pand+pxor or pand+pandn+por).

Phaeron - 27 04 08 - 01:10


Avery, you can always pack two 64-bit parts into one 128-bit register, pack/shuffle/unpack path is rediculously fast on Core 2+.
As for aligning memory, in most situations you can specify your own buffers, right? Unaligned acces to 16-byte data is up to 50% slower than aligned one so it quickly pays off to care about alignment all the time.

Igor Levicki (link) - 27 04 08 - 14:49


Yeah, you can do load+load+unpack, but that blunts a lot of the advantage that 128-bit code would have over 64-bit code. It's also less convenient in 128-bit mode because the XMM forms of the unpack low instructions do 128-bit loads, whereas the 64-bit versions only do 32-bit loads (this is now correct in the Intel manuals and can be verified via page fault behavior).

Don't forget as well that unaligned access has a benefit other than just not having to align data: it is essentially an embedded double-size shift operation. Any sort of FIR filter running at an irregular step can really benefit from this capability, even despite periodic misalignment penalties -- the cost of doing aligned loads and a manual fixup is very high (load + load + shift + shift + or, by my quick estimate, not counting shift computation). Sometimes you can pre-shift the coefficient tables to counteract this, as I did in 1.8.x, but you pay for it with more ALU operations and higher memory bandwidth requirements for the coefficient stream. It's also possible in cases, though, to arrange the misaligned loads so that they don't fall across cache line boundaries and thus end up being free. That's hard to beat with an aligned load requirement, even with the fixed offsets making use of PALIGNR possible.

I do try harder to avoid unaligned writes vs. reads, though, because if you're writing directly to video memory, unaligned writes can really kill your performance.

Phaeron - 27 04 08 - 16:21


Some compelling reasons to upgrade to VS2008:
http://blogs.msdn.com/vcblog/archive/200..
http://blogs.msdn.com/vcblog/archive/200..

Yuhong Bao - 12 05 08 - 14:20

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.