Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

New CPU's a bit faster than I expected

Midway through the development of the VirtualDub 1.5.x series, one of the decisions that I made was to deprecate the old VBitmap-based blit library and write a new library based on what I call pixmap objects. These were new descriptor structures that solved a number of annoyances in the old VBitmap structure, such as:

In addition, I started writing a unified blit interface (VDPixmapBlt) that would support orthogonal blitting between any two formats, with the only exception being paletted formats, which were source-only. This turned out to be a great decision, because now I don't have to worry about whether a conversion exists for a particular path and can just do blits as I need them. This isn't always advisable for performance or quality reasons -- Y8 is not a good intermediate format for conversion from 24-bit RGB to 32-bit RGB -- but it's wonderful for less critical paths such as UI and just migrating old code in general. It's the main reason that the number of RGB-only paths in VirtualDub has been shrinking over time and will continue to dwindle.

I've now begun to run up against the limits of this current scheme, and a while ago I started toying with a new scheme I call uberblit, which is a scanline-based scheme consisting of a tree of generators. The main motivations behind uberblit are to increase flexibility (dithering, chroma resampling options, more formats) and to eliminate the current table-based scheme where I have to initialize an O(N^2) table of blitters. It's turned out to be more complex than I like and hasn't yet become a good general purpose replacement for the existing blitter, but it has some advantages in terms of flexibility and data cache strategy, and it was useful enough that I pulled it into 1.8.0 for YCbCr resampling.

As part of some work to see if I could make it usable for general blitting, I wrote a benchmark to compare its performance against the existing blitters, and the result wasn't pretty. For conversions the uberblitter is almost always slower than the primary blitter, usually because either there are extra conversions, optimized span routines aren't hooked up, or simply just overhead in the row fetch paths (a few layers of virtual calls can be slower than the turbo SSE2 routine it controls). After a little bit of tweaking I got depressed, since it was pretty obvious that it would take more than hooking up a couple of routines to get uberblit up to snuff. Still, being able to more easily handle a wider variety of formats like YCbCr with PAL DV style chroma and float3/4 is a compelling reason to keep trying.

Anyway, back to the subject... after getting my new laptop, I decided to compare the two quickly just to see what the performance of the new CPU was like. The old laptop is a 1.8GHz Pentium M, while the new laptop is a current generation Core 2 Duo with SSE4.1 support ("Penryn" core). The dual core part is irrelevant here as the blitters are single threaded, as is SSE3/SSSE3/SSE4.1 as I don't use them yet, but the Core 2 Duo part is important as the extra execution unit and the single clock SSE throughput are very applicable.

Results are after the jump (all results in megapixels/second), but the gist is that the Penryn-based system is a lot faster at running the existing blitters than I had expected given the GHz ratings. The most likely reason is memory and the front side bus being faster, because the direct blit (no conversion) cases are ridiculously faster and those definitely are not execute bound as they're memcpy() based. It's also possible that the Core 2 Duo is cheating due to larger caches, but it wins handily even in the cases where heavy computation is involved (RGB<->YCbCr conversion with chroma subsampling). The blitters are also designed to reduce this effect by interleaving calculations via scanline stripes whenever multipassing is required, so that data flows in and out of the caches smoothly instead of having performance drop off a cliff after a certain bitmap size.

Source Dest Pentium-M 1.8GHz Penryn 2.5GHz
XRGB1555 XRGB1555 632.32MP/sec 4044.64MP/sec
RGB565 XRGB1555 614.11MP/sec 2896.16MP/sec
RGB888 XRGB1555 102.98MP/sec 443.46MP/sec
XRGB8888 XRGB1555 338.51MP/sec 1503.78MP/sec
Y8 XRGB1555 234.12MP/sec 747.10MP/sec
UYVY XRGB1555 34.46MP/sec 126.74MP/sec
YUYV XRGB1555 34.46MP/sec 126.74MP/sec
YUV444 XRGB1555 110.97MP/sec 455.51MP/sec
YUV422 XRGB1555 94.33MP/sec 383.32MP/sec
YUV420 XRGB1555 86.44MP/sec 343.97MP/sec
YUV411 XRGB1555 62.96MP/sec 250.90MP/sec
YUV410 XRGB1555 50.78MP/sec 210.21MP/sec
XRGB1555 RGB565 546.83MP/sec 2277.56MP/sec
RGB565 RGB565 634.02MP/sec 4044.64MP/sec
RGB888 RGB565 102.84MP/sec 445.14MP/sec
XRGB8888 RGB565 302.31MP/sec 1379.94MP/sec
Y8 RGB565 228.20MP/sec 733.09MP/sec
UYVY RGB565 39.54MP/sec 148.85MP/sec
YUYV RGB565 39.53MP/sec 149.32MP/sec
YUV444 RGB565 266.58MP/sec 520.15MP/sec
YUV422 RGB565 225.57MP/sec 430.44MP/sec
YUV420 RGB565 206.14MP/sec 382.69MP/sec
YUV411 RGB565 148.95MP/sec 255.27MP/sec
YUV410 RGB565 120.36MP/sec 213.07MP/sec
XRGB1555 RGB888 193.56MP/sec 334.17MP/sec
RGB565 RGB888 193.56MP/sec 338.02MP/sec
RGB888 RGB888 985.67MP/sec 2826.37MP/sec
XRGB8888 RGB888 715.21MP/sec 1413.19MP/sec
Y8 RGB888 344.48MP/sec 454.63MP/sec
UYVY RGB888 172.11MP/sec 242.34MP/sec
YUYV RGB888 172.24MP/sec 242.09MP/sec
YUV444 RGB888 103.89MP/sec 141.92MP/sec
YUV422 RGB888 97.95MP/sec 135.76MP/sec
YUV420 RGB888 92.61MP/sec 128.54MP/sec
YUV411 RGB888 79.60MP/sec 113.71MP/sec
YUV410 RGB888 70.57MP/sec 103.94MP/sec
XRGB1555 XRGB8888 637.47MP/sec 1167.11MP/sec
RGB565 XRGB8888 687.94MP/sec 1221.82MP/sec
RGB888 XRGB8888 902.27MP/sec 1563.93MP/sec
XRGB8888 XRGB8888 744.73MP/sec 2152.19MP/sec
Y8 XRGB8888 358.70MP/sec 700.27MP/sec
UYVY XRGB8888 171.61MP/sec 239.38MP/sec
YUYV XRGB8888 171.86MP/sec 262.11MP/sec
YUV444 XRGB8888 278.61MP/sec 543.03MP/sec
YUV422 XRGB8888 238.65MP/sec 448.55MP/sec
YUV420 XRGB8888 216.01MP/sec 398.96MP/sec
YUV411 XRGB8888 153.23MP/sec 285.04MP/sec
YUV410 XRGB8888 123.27MP/sec 236.72MP/sec
XRGB1555 Y8 165.90MP/sec 325.82MP/sec
RGB565 Y8 177.58MP/sec 320.92MP/sec
RGB888 Y8 270.89MP/sec 503.41MP/sec
XRGB8888 Y8 267.49MP/sec 506.67MP/sec
Y8 Y8 2792.73MP/sec 7330.91MP/sec
UYVY Y8 787.21MP/sec 1111.80MP/sec
YUYV Y8 789.86MP/sec 1111.80MP/sec
YUV444 Y8 2759.87MP/sec 7108.76MP/sec
YUV422 Y8 2759.87MP/sec 7108.76MP/sec
YUV420 Y8 2792.73MP/sec 7108.76MP/sec
YUV411 Y8 2759.87MP/sec 7108.76MP/sec
YUV410 Y8 2759.87MP/sec 7108.76MP/sec
XRGB1555 UYVY 54.56MP/sec 98.28MP/sec
RGB565 UYVY 56.42MP/sec 100.68MP/sec
RGB888 UYVY 65.66MP/sec 115.50MP/sec
XRGB8888 UYVY 65.51MP/sec 114.16MP/sec
Y8 UYVY 541.78MP/sec 853.05MP/sec
UYVY UYVY 1457.07MP/sec 4044.64MP/sec
YUYV UYVY 617.34MP/sec 1015.54MP/sec
YUV444 UYVY 202.41MP/sec 367.69MP/sec
YUV422 UYVY 468.24MP/sec 781.96MP/sec
YUV420 UYVY 386.47MP/sec 634.02MP/sec
YUV411 UYVY 390.33MP/sec 644.48MP/sec
YUV410 UYVY 244.11MP/sec 410.84MP/sec
XRGB1555 YUYV 54.08MP/sec 97.14MP/sec
RGB565 YUYV 55.80MP/sec 99.61MP/sec
RGB888 YUYV 64.97MP/sec 112.51MP/sec
XRGB8888 YUYV 64.63MP/sec 112.14MP/sec
Y8 YUYV 551.97MP/sec 849.96MP/sec
UYVY YUYV 617.34MP/sec 1015.54MP/sec
YUYV YUYV 1457.07MP/sec 4044.64MP/sec
YUV444 YUYV 201.88MP/sec 368.85MP/sec
YUV422 YUYV 465.45MP/sec 784.58MP/sec
YUV420 YUYV 385.84MP/sec 634.02MP/sec
YUV411 YUYV 390.33MP/sec 648.04MP/sec
YUV410 YUYV 243.60MP/sec 413.01MP/sec
XRGB1555 YUV444 75.99MP/sec 136.15MP/sec
RGB565 YUV444 77.19MP/sec 140.81MP/sec
RGB888 YUV444 96.34MP/sec 164.28MP/sec
XRGB8888 YUV444 96.30MP/sec 163.82MP/sec
Y8 YUV444 678.00MP/sec 2277.56MP/sec
UYVY YUV444 351.18MP/sec 572.17MP/sec
YUYV YUV444 351.18MP/sec 570.78MP/sec
YUV444 YUV444 981.54MP/sec 2792.73MP/sec
YUV422 YUV444 849.96MP/sec 1533.26MP/sec
YUV420 YUV444 639.21MP/sec 1133.28MP/sec
YUV411 YUV444 284.01MP/sec 509.98MP/sec
YUV410 YUV444 148.76MP/sec 286.08MP/sec
XRGB1555 YUV422 62.59MP/sec 109.11MP/sec
RGB565 YUV422 63.59MP/sec 115.56MP/sec
RGB888 YUV422 76.34MP/sec 132.76MP/sec
XRGB8888 YUV422 75.33MP/sec 132.99MP/sec
Y8 YUV422 1091.11MP/sec 2969.48MP/sec
UYVY YUV422 445.14MP/sec 694.05MP/sec
YUYV YUV422 445.14MP/sec 689.97MP/sec
YUV444 YUV422 302.70MP/sec 612.50MP/sec
YUV422 YUV422 1457.07MP/sec 4044.64MP/sec
YUV420 YUV422 1101.36MP/sec 2039.90MP/sec
YUV411 YUV422 1066.31MP/sec 2005.03MP/sec
YUV410 YUV422 297.70MP/sec 566.64MP/sec
XRGB1555 YUV420 55.63MP/sec 100.04MP/sec
RGB565 YUV420 55.79MP/sec 102.62MP/sec
RGB888 YUV420 65.51MP/sec 123.21MP/sec
XRGB8888 YUV420 65.29MP/sec 123.08MP/sec
Y8 YUV420 1574.42MP/sec 4511.33MP/sec
UYVY YUV420 232.73MP/sec 383.32MP/sec
YUYV YUV420 232.50MP/sec 382.07MP/sec
YUV444 YUV420 186.33MP/sec 341.97MP/sec
YUV422 YUV420 402.38MP/sec 681.94MP/sec
YUV420 YUV420 1922.86MP/sec 5213.09MP/sec
YUV411 YUV420 339.49MP/sec 583.55MP/sec
YUV410 YUV420 1325.36MP/sec 2522.46MP/sec
XRGB1555 YUV411 65.77MP/sec 112.30MP/sec
RGB565 YUV411 66.87MP/sec 120.55MP/sec
RGB888 YUV411 81.23MP/sec 137.19MP/sec
XRGB8888 YUV411 80.61MP/sec 137.11MP/sec
Y8 YUV411 1404.72MP/sec 3449.84MP/sec
UYVY YUV411 274.69MP/sec 457.29MP/sec
YUYV YUV411 274.69MP/sec 463.61MP/sec
YUV444 YUV411 392.95MP/sec 704.47MP/sec
YUV422 YUV411 523.64MP/sec 953.61MP/sec
YUV420 YUV411 679.97MP/sec 1209.22MP/sec
YUV411 YUV411 1907.23MP/sec 4887.27MP/sec
YUV410 YUV411 539.29MP/sec 1024.41MP/sec
XRGB1555 YUV410 59.33MP/sec 104.68MP/sec
RGB565 YUV410 60.15MP/sec 107.31MP/sec
RGB888 YUV410 71.83MP/sec 119.75MP/sec
XRGB8888 YUV410 71.63MP/sec 119.87MP/sec
Y8 YUV410 2234.18MP/sec 5585.45MP/sec
UYVY YUV410 193.88MP/sec 313.20MP/sec
YUYV YUV410 193.88MP/sec 313.20MP/sec
YUV444 YUV410 247.98MP/sec 410.84MP/sec
YUV422 YUV410 295.08MP/sec 481.70MP/sec
YUV420 YUV410 592.40MP/sec 1101.36MP/sec
YUV411 YUV410 508.87MP/sec 811.73MP/sec
YUV410 YUV410 2495.63MP/sec 6516.36MP/sec

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.