Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ MMX throughout the years

VirtualDub, as a video program, is a very heavy user of MMX integer vector math instructions in x86 CPUs; in fact, most of the inner processing loops are almost exclusively MMX. Part of the reason is ease of coding, since it's easier to operate on an (R,G,B) triplet with one MMX instruction than with three separate ones and trying to juggle three times as many values in only eight registers. Another, though, is the significant performance gains that result.

A problem with using MMX, and its successor instruction set extensions SSE and SSE2, is that you have to pay careful attention to what CPUs support what extensions and how they perform on each. Here is a braindump on what my experiences have been throughout the years while working on VirtualDub.

The Pentium MMX:

It all started with the Pentium MMX, which introduced the first "official" vector instruction set extension to the architecture. The rules for MMX execution were as follows:

* The Pentium had a pair of execution pipes, U and V, so its peak throughput was two instructions per clock.
* Most instructions could decode and execute in either pipe in one clock.
* Only one multiply and one shift could execute per cycle.
* Multiplies had a latency of three clocks.
* Memory or integer file accesses could only execute in the U pipe, and could not pair with non-MMX instructions.
* One cycle had to pass between the last write to an MMX register and a store from it.

The rules weren't that hard to follow and thus it wasn't that hard to achieve two MMX ops per clock. The toughest parts were trying to cover the 3 clock latency on multiplies, and the store latency, which usually meant twisting the end of a loop a bit to avoid a stall on the store and also avoid an address generation interlock (AGI) at the top of the next loop.

A lot of VirtualDub's older MMX code is tuned against the Pentium MMX, such as the reduce and resize filters. MMX was a huge jump in performance for many video tasks -- some of VirtualDub's routines run three times faster with MMX than without it. I have fond memories of the Pentium MMX because the additional MMX instructions were the key to making my MPEG-1 video decoder achieve full frame rate on my 200MHz machine at the time.

The Pentium II:

Intel's Pentium II CPU brought MMX to the Pentium Pro's out-of-order architecture. It used the same MMX unit as the Pentium MMX, so the execution behavior was the same; even the 4-1-1 template didn't make much difference as all of the MMX execute instructions were 1 uop and load-execute or store instructions were 2, which gave you the same decoding behavior as the PMMX's memory-in-U restriction. The main change was that OOO execution meant that you didn't have to manually cover multiply latency anymore -- putting a dependant op right up against a multiply wasn't a guaranteed stall anymore.

I should note that optimizing for the Pentium II was frustrating compared to the Pentium, because the out-of-order architecture made it difficult to determine bottlenecks. However, this was a period of very rapid CPU power increase -- the Pentium Pro architecture started around 150MHz and made it all the way up to 1.13GHz with the Pentium III. With such a ridiculous rate of increase, real-time 320x240 soon became a no-brainer, and full-size 640x480 was a reality.

The Pentium III:

The Pentium III added SSE, which was a mix of a few integer and a bunch of floating-point vector instructions. The integer instructions were welcome, particularly shuffle, prefetch, and streaming store. A couple of new averaging instructions helped speed up MPEG decoders, but the big one was the packed sum of absolute differences instruction (psadbw), which boosted encoder performance.

The floating-point instructions, on the other hand... were a mixed bag. Part of the problem was the awkward data movement; getting any integer values smaller than 32-bit into SSE registers was a pain, and the shuffle instruction was weird. What was really bad, though, was that all 128-bit SSE ops actually executed as two 64-bit ops to the same single execution port. This meant the CPU could only decode one 128-bit instruction per cycle and only execute them every two clocks! For algorithms that could use either MMX or SSE, this meant a hefty 4:1 advantage in peak throughput in MMX's favor. In fact, except for the additional registers, SSE operations act a lot like pairs of 3DNow! instructions.

For the above reasons, and also because floating-point SSE isn't supported by AMD until all the way to the Athlon XP, I haven't used much FP SSE in VirtualDub. There is a little of it in the audio code for sample conversion, but that's it.

The Pentium 4:

Pentium 4's NetBurst architecture brings a revamped pipeline and the SSE2 instruction set to the table, the latter of which adds double-precision and integer operations to SSE. However, it also brought a set of new challenges. One is that the Pentium 4 has pretty bad latency over a wide range of instructions; another is that only one execution port can execute MMX ALU operations, instantly halving peak throughput over the Pentium II architecture. Even worse, register-to-register moves have a ridiculous six clock latency, which makes MMX dependency chains on the Pentium 4 quite long. Overall, this makes the Pentium 4 rather bad at MMX compared to the Pentium II and III.

The saving grace, however, is that SSE and SSE2 128-bit ops only take a single micro-op on P4, compared to two on PII/III. This means that 128-bit instructions can issue twice as fast as they execute, and more importantly, means that you can alternate between execution subunits. In particular, multiply and add/shift operations can overlap. When perfectly balanced this leads to two 64-bit operations per cycle and thus parity with the PII/III's peak throughput. You can usually get decent gains in a FIR loop, where you have a mix of pmaddwd and paddd instructions.

I should note that the Pentium 4's problems have to be balanced against its hefty 50% clock rate lead. It's difficult to achieve, but a Pentium 4 running at peak rate crunches pixels at a very scary rate.

The definitive resource for information on Pentium, PPro/II/III, and P4 tuning is Agner Fog's How to optimize for the Pentium microprocessors.

The Pentium-M:

Pentium-M uses a revamped Pentium III architecture with support for SSE2 instructions. I don't have a Pentium-M and thus don't have tuning experience with it, although from what I hear its per-clock performance is a lot more pleasing than the Pentium 4. What would be interesting is to find out whether SSE2 actually helps or hurts the Pentium-M compared to MMX. Its decoder is improved compared to the Pentium III, but if it still has problems decoding multi-uop instructions then it may actually be faster to use MMX than SSE2.

The Athlon 64:

AMD's Athlon 64 has a burly front-end decoder, so it has many fewer decode bottlenecks than Intel CPUs. From what I can tell so far in CodeAnalyst's pipeline analysis mode, its execution performance for MMX/SSE/SSE2 code is very comparable to the Pentium II/III: two 64-bit MMX ops/clock or one 128-bit SSE2 op/clock. The one execution advantage that I know of so far is that the Athlon 64 can do two 64-bit shifts per clock, which is a dubious advantage. This isn't surprising considering that benchmarks have shown the Pentium-M's enhanced PPro architecture to be competitive with Athlons in performance.

I've only recently begun profiling VirtualDub code on Athlon 64, but for the most part it's a lot simpler: to figure out the decode clock time in a loop, take the number of instructions in the loop and divide by three. After that, keep the pipes full by breaking dependencies and balancing the execution units.

Comments

Comments posted:


Holy crap! That's informative, AND interesting. I also understood it.
Good job!

Kizawa (link) - 29 10 04 - 19:50


that is great:)

meilin - 30 10 04 - 06:05


and what about 3dnow and 3dnow enh?

Guillermo - 30 10 04 - 12:22


Very informative. Thank you again for this wonderful program. I for one am glad you had too much free time in college!

Natakel - 31 10 04 - 02:02


i liked it. being a young cs major i am greatly curious about such insights from experinced programmers. though i am dissapointed that i can't yet get a version working for my mac

Corey Tebo - 31 10 04 - 17:30


I have to agree informative and somewhat understandable (at least for someone whos never written in x86 assembly before).

I too would like to get your perspective/thoughts/complaints on 3DNow! and any other experiences, adventures, and/or heartbreaks youve experienced with AMD chips (Ive got a K6-2 500, an Athlon-C 1.4GHz, and a really nice Athlon64 3000+, so can you tell why Im curious?).

I love the frequency of your updates with your new blogging/news publishing software, by the way. : -D

Abu Hurayrah (link) - 01 11 04 - 03:10


Very nice and informative article! Thanks for all the input to my brain :-)

Andreas - 01 11 04 - 07:24


Good article.. BTW: Who made Pentium M, with its excellent efficiency per clock? Where Intel got(bought) it?

TomK - 01 11 04 - 07:32


Cool! Just some random guy looking for Virtualdub and decided to read this article. Reminds me of my softice days.

Vidness (link) - 01 11 04 - 11:29


blimey. I also was just looking for Virtualdub and read this. Reminds me of 1984 when writing assembler for the Z80 was the only way to get performance out of a machine. Makes me think that if the guys writing Excel or whatever knew as much as you, we might all be a LOT more productive. Good on you (a) for knowing (b) for doing and (c) for inspiring.

greg - 01 11 04 - 18:34


Fantastic article. Even more fantastic program. I use it all the while. Thank God for college!!!

Tropical - 02 11 04 - 00:41


That was a lot of good information I had no idea of. I understood most of it even though I have never done any assembly coding. It makes me want to learn x86 assembly and start working on some projects with it!!

Jeff - 02 11 04 - 14:26

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.