## § ¶New CPU's a bit faster than I expected

Midway through the development of the VirtualDub 1.5.x series, one of the decisions that I made was to deprecate the old VBitmap-based blit library and write a new library based on what I call pixmap objects. These were new descriptor structures that solved a number of annoyances in the old VBitmap structure, such as:

- not being able to distinguish between 555 and 565 16-bit RGB
- not supporting YCbCr or multi-plane formats
- *#($&# upside down encoding

In addition, I started writing a unified blit interface (VDPixmapBlt) that would support orthogonal blitting between any two formats, with the only exception being paletted formats, which were source-only. This turned out to be a great decision, because now I don't have to worry about whether a conversion exists for a particular path and can just do blits as I need them. This isn't always advisable for performance or quality reasons -- Y8 is not a good intermediate format for conversion from 24-bit RGB to 32-bit RGB -- but it's wonderful for less critical paths such as UI and just migrating old code in general. It's the main reason that the number of RGB-only paths in VirtualDub has been shrinking over time and will continue to dwindle.

I've now begun to run up against the limits of this current scheme, and a while ago I started toying with a new scheme I call uberblit, which is a scanline-based scheme consisting of a tree of generators. The main motivations behind uberblit are to increase flexibility (dithering, chroma resampling options, more formats) and to eliminate the current table-based scheme where I have to initialize an O(N^2) table of blitters. It's turned out to be more complex than I like and hasn't yet become a good general purpose replacement for the existing blitter, but it has some advantages in terms of flexibility and data cache strategy, and it was useful enough that I pulled it into 1.8.0 for YCbCr resampling.

As part of some work to see if I could make it usable for general blitting, I wrote a benchmark to compare its performance against the existing blitters, and the result wasn't pretty. For conversions the uberblitter is almost always slower than the primary blitter, usually because either there are extra conversions, optimized span routines aren't hooked up, or simply just overhead in the row fetch paths (a few layers of virtual calls can be slower than the turbo SSE2 routine it controls). After a little bit of tweaking I got depressed, since it was pretty obvious that it would take more than hooking up a couple of routines to get uberblit up to snuff. Still, being able to more easily handle a wider variety of formats like YCbCr with PAL DV style chroma and float3/4 is a compelling reason to keep trying.

Anyway, back to the subject... after getting my new laptop, I decided to compare the two quickly just to see what the performance of the new CPU was like. The old laptop is a 1.8GHz Pentium M, while the new laptop is a current generation Core 2 Duo with SSE4.1 support ("Penryn" core). The dual core part is irrelevant here as the blitters are single threaded, as is SSE3/SSSE3/SSE4.1 as I don't use them yet, but the Core 2 Duo part is important as the extra execution unit and the single clock SSE throughput are *very* applicable.

Results are after the jump (all results in megapixels/second), but the gist is that the Penryn-based system is a **lot** faster at running the existing blitters than I had expected given the GHz ratings. The most likely reason is memory and the front side bus being faster, because the direct blit (no conversion) cases are ridiculously faster and those definitely are not execute bound as they're memcpy() based. It's also possible that the Core 2 Duo is cheating due to larger caches, but it wins handily even in the cases where heavy computation is involved (RGB<->YCbCr conversion with chroma subsampling). The blitters are also designed to reduce this effect by interleaving calculations via scanline stripes whenever multipassing is required, so that data flows in and out of the caches smoothly instead of having performance drop off a cliff after a certain bitmap size.

Source | Dest | Pentium-M 1.8GHz | Penryn 2.5GHz |

XRGB1555 | XRGB1555 | 632.32MP/sec | 4044.64MP/sec |

RGB565 | XRGB1555 | 614.11MP/sec | 2896.16MP/sec |

RGB888 | XRGB1555 | 102.98MP/sec | 443.46MP/sec |

XRGB8888 | XRGB1555 | 338.51MP/sec | 1503.78MP/sec |

Y8 | XRGB1555 | 234.12MP/sec | 747.10MP/sec |

UYVY | XRGB1555 | 34.46MP/sec | 126.74MP/sec |

YUYV | XRGB1555 | 34.46MP/sec | 126.74MP/sec |

YUV444 | XRGB1555 | 110.97MP/sec | 455.51MP/sec |

YUV422 | XRGB1555 | 94.33MP/sec | 383.32MP/sec |

YUV420 | XRGB1555 | 86.44MP/sec | 343.97MP/sec |

YUV411 | XRGB1555 | 62.96MP/sec | 250.90MP/sec |

YUV410 | XRGB1555 | 50.78MP/sec | 210.21MP/sec |

XRGB1555 | RGB565 | 546.83MP/sec | 2277.56MP/sec |

RGB565 | RGB565 | 634.02MP/sec | 4044.64MP/sec |

RGB888 | RGB565 | 102.84MP/sec | 445.14MP/sec |

XRGB8888 | RGB565 | 302.31MP/sec | 1379.94MP/sec |

Y8 | RGB565 | 228.20MP/sec | 733.09MP/sec |

UYVY | RGB565 | 39.54MP/sec | 148.85MP/sec |

YUYV | RGB565 | 39.53MP/sec | 149.32MP/sec |

YUV444 | RGB565 | 266.58MP/sec | 520.15MP/sec |

YUV422 | RGB565 | 225.57MP/sec | 430.44MP/sec |

YUV420 | RGB565 | 206.14MP/sec | 382.69MP/sec |

YUV411 | RGB565 | 148.95MP/sec | 255.27MP/sec |

YUV410 | RGB565 | 120.36MP/sec | 213.07MP/sec |

XRGB1555 | RGB888 | 193.56MP/sec | 334.17MP/sec |

RGB565 | RGB888 | 193.56MP/sec | 338.02MP/sec |

RGB888 | RGB888 | 985.67MP/sec | 2826.37MP/sec |

XRGB8888 | RGB888 | 715.21MP/sec | 1413.19MP/sec |

Y8 | RGB888 | 344.48MP/sec | 454.63MP/sec |

UYVY | RGB888 | 172.11MP/sec | 242.34MP/sec |

YUYV | RGB888 | 172.24MP/sec | 242.09MP/sec |

YUV444 | RGB888 | 103.89MP/sec | 141.92MP/sec |

YUV422 | RGB888 | 97.95MP/sec | 135.76MP/sec |

YUV420 | RGB888 | 92.61MP/sec | 128.54MP/sec |

YUV411 | RGB888 | 79.60MP/sec | 113.71MP/sec |

YUV410 | RGB888 | 70.57MP/sec | 103.94MP/sec |

XRGB1555 | XRGB8888 | 637.47MP/sec | 1167.11MP/sec |

RGB565 | XRGB8888 | 687.94MP/sec | 1221.82MP/sec |

RGB888 | XRGB8888 | 902.27MP/sec | 1563.93MP/sec |

XRGB8888 | XRGB8888 | 744.73MP/sec | 2152.19MP/sec |

Y8 | XRGB8888 | 358.70MP/sec | 700.27MP/sec |

UYVY | XRGB8888 | 171.61MP/sec | 239.38MP/sec |

YUYV | XRGB8888 | 171.86MP/sec | 262.11MP/sec |

YUV444 | XRGB8888 | 278.61MP/sec | 543.03MP/sec |

YUV422 | XRGB8888 | 238.65MP/sec | 448.55MP/sec |

YUV420 | XRGB8888 | 216.01MP/sec | 398.96MP/sec |

YUV411 | XRGB8888 | 153.23MP/sec | 285.04MP/sec |

YUV410 | XRGB8888 | 123.27MP/sec | 236.72MP/sec |

XRGB1555 | Y8 | 165.90MP/sec | 325.82MP/sec |

RGB565 | Y8 | 177.58MP/sec | 320.92MP/sec |

RGB888 | Y8 | 270.89MP/sec | 503.41MP/sec |

XRGB8888 | Y8 | 267.49MP/sec | 506.67MP/sec |

Y8 | Y8 | 2792.73MP/sec | 7330.91MP/sec |

UYVY | Y8 | 787.21MP/sec | 1111.80MP/sec |

YUYV | Y8 | 789.86MP/sec | 1111.80MP/sec |

YUV444 | Y8 | 2759.87MP/sec | 7108.76MP/sec |

YUV422 | Y8 | 2759.87MP/sec | 7108.76MP/sec |

YUV420 | Y8 | 2792.73MP/sec | 7108.76MP/sec |

YUV411 | Y8 | 2759.87MP/sec | 7108.76MP/sec |

YUV410 | Y8 | 2759.87MP/sec | 7108.76MP/sec |

XRGB1555 | UYVY | 54.56MP/sec | 98.28MP/sec |

RGB565 | UYVY | 56.42MP/sec | 100.68MP/sec |

RGB888 | UYVY | 65.66MP/sec | 115.50MP/sec |

XRGB8888 | UYVY | 65.51MP/sec | 114.16MP/sec |

Y8 | UYVY | 541.78MP/sec | 853.05MP/sec |

UYVY | UYVY | 1457.07MP/sec | 4044.64MP/sec |

YUYV | UYVY | 617.34MP/sec | 1015.54MP/sec |

YUV444 | UYVY | 202.41MP/sec | 367.69MP/sec |

YUV422 | UYVY | 468.24MP/sec | 781.96MP/sec |

YUV420 | UYVY | 386.47MP/sec | 634.02MP/sec |

YUV411 | UYVY | 390.33MP/sec | 644.48MP/sec |

YUV410 | UYVY | 244.11MP/sec | 410.84MP/sec |

XRGB1555 | YUYV | 54.08MP/sec | 97.14MP/sec |

RGB565 | YUYV | 55.80MP/sec | 99.61MP/sec |

RGB888 | YUYV | 64.97MP/sec | 112.51MP/sec |

XRGB8888 | YUYV | 64.63MP/sec | 112.14MP/sec |

Y8 | YUYV | 551.97MP/sec | 849.96MP/sec |

UYVY | YUYV | 617.34MP/sec | 1015.54MP/sec |

YUYV | YUYV | 1457.07MP/sec | 4044.64MP/sec |

YUV444 | YUYV | 201.88MP/sec | 368.85MP/sec |

YUV422 | YUYV | 465.45MP/sec | 784.58MP/sec |

YUV420 | YUYV | 385.84MP/sec | 634.02MP/sec |

YUV411 | YUYV | 390.33MP/sec | 648.04MP/sec |

YUV410 | YUYV | 243.60MP/sec | 413.01MP/sec |

XRGB1555 | YUV444 | 75.99MP/sec | 136.15MP/sec |

RGB565 | YUV444 | 77.19MP/sec | 140.81MP/sec |

RGB888 | YUV444 | 96.34MP/sec | 164.28MP/sec |

XRGB8888 | YUV444 | 96.30MP/sec | 163.82MP/sec |

Y8 | YUV444 | 678.00MP/sec | 2277.56MP/sec |

UYVY | YUV444 | 351.18MP/sec | 572.17MP/sec |

YUYV | YUV444 | 351.18MP/sec | 570.78MP/sec |

YUV444 | YUV444 | 981.54MP/sec | 2792.73MP/sec |

YUV422 | YUV444 | 849.96MP/sec | 1533.26MP/sec |

YUV420 | YUV444 | 639.21MP/sec | 1133.28MP/sec |

YUV411 | YUV444 | 284.01MP/sec | 509.98MP/sec |

YUV410 | YUV444 | 148.76MP/sec | 286.08MP/sec |

XRGB1555 | YUV422 | 62.59MP/sec | 109.11MP/sec |

RGB565 | YUV422 | 63.59MP/sec | 115.56MP/sec |

RGB888 | YUV422 | 76.34MP/sec | 132.76MP/sec |

XRGB8888 | YUV422 | 75.33MP/sec | 132.99MP/sec |

Y8 | YUV422 | 1091.11MP/sec | 2969.48MP/sec |

UYVY | YUV422 | 445.14MP/sec | 694.05MP/sec |

YUYV | YUV422 | 445.14MP/sec | 689.97MP/sec |

YUV444 | YUV422 | 302.70MP/sec | 612.50MP/sec |

YUV422 | YUV422 | 1457.07MP/sec | 4044.64MP/sec |

YUV420 | YUV422 | 1101.36MP/sec | 2039.90MP/sec |

YUV411 | YUV422 | 1066.31MP/sec | 2005.03MP/sec |

YUV410 | YUV422 | 297.70MP/sec | 566.64MP/sec |

XRGB1555 | YUV420 | 55.63MP/sec | 100.04MP/sec |

RGB565 | YUV420 | 55.79MP/sec | 102.62MP/sec |

RGB888 | YUV420 | 65.51MP/sec | 123.21MP/sec |

XRGB8888 | YUV420 | 65.29MP/sec | 123.08MP/sec |

Y8 | YUV420 | 1574.42MP/sec | 4511.33MP/sec |

UYVY | YUV420 | 232.73MP/sec | 383.32MP/sec |

YUYV | YUV420 | 232.50MP/sec | 382.07MP/sec |

YUV444 | YUV420 | 186.33MP/sec | 341.97MP/sec |

YUV422 | YUV420 | 402.38MP/sec | 681.94MP/sec |

YUV420 | YUV420 | 1922.86MP/sec | 5213.09MP/sec |

YUV411 | YUV420 | 339.49MP/sec | 583.55MP/sec |

YUV410 | YUV420 | 1325.36MP/sec | 2522.46MP/sec |

XRGB1555 | YUV411 | 65.77MP/sec | 112.30MP/sec |

RGB565 | YUV411 | 66.87MP/sec | 120.55MP/sec |

RGB888 | YUV411 | 81.23MP/sec | 137.19MP/sec |

XRGB8888 | YUV411 | 80.61MP/sec | 137.11MP/sec |

Y8 | YUV411 | 1404.72MP/sec | 3449.84MP/sec |

UYVY | YUV411 | 274.69MP/sec | 457.29MP/sec |

YUYV | YUV411 | 274.69MP/sec | 463.61MP/sec |

YUV444 | YUV411 | 392.95MP/sec | 704.47MP/sec |

YUV422 | YUV411 | 523.64MP/sec | 953.61MP/sec |

YUV420 | YUV411 | 679.97MP/sec | 1209.22MP/sec |

YUV411 | YUV411 | 1907.23MP/sec | 4887.27MP/sec |

YUV410 | YUV411 | 539.29MP/sec | 1024.41MP/sec |

XRGB1555 | YUV410 | 59.33MP/sec | 104.68MP/sec |

RGB565 | YUV410 | 60.15MP/sec | 107.31MP/sec |

RGB888 | YUV410 | 71.83MP/sec | 119.75MP/sec |

XRGB8888 | YUV410 | 71.63MP/sec | 119.87MP/sec |

Y8 | YUV410 | 2234.18MP/sec | 5585.45MP/sec |

UYVY | YUV410 | 193.88MP/sec | 313.20MP/sec |

YUYV | YUV410 | 193.88MP/sec | 313.20MP/sec |

YUV444 | YUV410 | 247.98MP/sec | 410.84MP/sec |

YUV422 | YUV410 | 295.08MP/sec | 481.70MP/sec |

YUV420 | YUV410 | 592.40MP/sec | 1101.36MP/sec |

YUV411 | YUV410 | 508.87MP/sec | 811.73MP/sec |

YUV410 | YUV410 | 2495.63MP/sec | 6516.36MP/sec |