§ ¶UYVY vs. YUY2
Video formats on Windows are described primarily by a "four character code," or FOURCC for short. While most describe compressed formats such as Cinepak and various MPEG-4 variants, FOURCCs are also assigned for YCbCr compressed formats that aren't natively supported by the regular GDI graphics API. These FOURCCs are used to allow video codecs to recognize their own formats for decoding purposes, as well as to allow two codecs to agree on a common interchange format.
Two common YCbCr FOURCCs are UYVY and YUY2. They are interleaved, meaning that all YCbCr components are stored in one stream. Chroma (color) information is stored at half horizontal resolution compared to luma, with chroma samples located on top of the left luma sample of each pair; luma range is 16-235 and chroma range is 16-240. The formats are named after the byte order, so for UYVY it is U (Cb), Y (luma 1), V (Cr), Y (luma 2), whereas the luma and chroma bytes are swapped for YUY2 -- Y/U/Y/V.
On Windows, it seems that YUY2 is the more common of the two formats -- Avisynth and Huffyuv prefer it, the MPEG-1 decoder lists it first, etc. Most hardware is also capable of using both formats. Ordinarily I would consider supporting only YUY2, except that the Adaptec Gamebridge device I recently acquired only supports UYVY. Now, when working with these formats in regular CPU based code, the distinction between these formats is minimal, as permuting the indices is sufficient to accommodate both. (In VirtualDub, conversion between UYVY and YUY2 is lossless.) When working with vector units, however, the difference between them can become problematic.
In my particular case, I'm looking at Direct3D-accelerated conversion of these formats to RGB, so the graphics card's vector unit is the pertinent one.
There are a few reasons I'm pursuing this path. One is that DirectDraw support on Windows Vista RTM seems to be pretty goofed up; video overlays seem to be badly broken on the current NVIDIA drivers for Vista, even with Aero Glass disabled. Second, I'm experimenting with real-time shader effects on live video, and want to eliminate the current RGB-to-YCbCr CPU-based conversion that occurs when Direct3D display is enabled in VirtualDub. Third, I've never done it before.
If you're familiar with Direct3D, you might wonder why I don't just use UYVY or YUY2 hardware support. Well, unfortunately, although YCbCr textures are supported by ATI, they're not supported on NVIDIA hardware. Both do support StretchRect() from a YCbCr surface to an RGB render target, but there are luma range problems when doing this. So it's down to pixel shaders.
Now, I have a bit of fondness for older hardware, and as such, I want this to work on the lowest pixel shader profile, pixel shader 1.1. The general idea is to upload the UYVY to YUY2 data to the video card as A8R8G8B8 data, and then convert that in the pixel shader to RGB data. The equations for converting UYVY/YUY2 data to RGB are as follows:
R = 1.164(Y-16) + 1.596(Cr-128)
G = 1.164(Y-16) - 0.813(Cr-128) - 0.391(Cb-128)
B = 1.164(Y-16) + 2.018(Cb-128)
As it turns out, this works out very well for UYVY. Cb and Cr naturally fall into the blue and red channels of the A8R8G8B8 texture; chroma green can be computed via a dot product and merged with a lerp. A little logic for selecting between the two luma samples based on even/odd horizontal position, and we're done. Heck, we can even use the bilinear filtering hardware to interpolate the chroma, too.
YUY2, however, is more annoying because Cb and Cr fall into green and alpha, respectively. Pixel shader 1.1 is very restricted in the channel manipulation available and instructions can neither swizzle the RGB channels nor write to only some of them; also, there is no dp4 instruction for including alpha in a dot product in 1.1. Just moving the scaled Cb and Cr into position consumes two of the precious eight vector instructions:
def c0, 0, 0.5045, 0, 0 ;c0.g = Cb_to_B_coeff / 4
def c1, 1, 0, 0, 0.798 ;c1.rgb = red | c1.a = Cr_to_R_coeff / 2
dp3 r0.rgb, t1_bx2, c0 ;decode Cb (green) -> chroma blue / 2
+ mul r0.a, t1_bias, c1.a ;decode Cr (alpha) -> chroma red / 2
lrp r0.rgb, c1, r0.a, r0 ;merge chroma red
The net result is that so far, my YUY2 shader requires one instruction pair more than the UYVY shader. I don't know if this is significant in practice, since the underlying register combiner setup of a GeForce 3 is very different and considerably more powerful than Direct3D ps1.1 -- it can do dot(A,B)+dot(C,D) or A*B+C*D in one cycle -- but I have no idea how effective the driver is at recompiling the shader for that architecture.
(If you're willing to step up to a RADEON 8500 and ps1.4, all of this becomes moot due to availability of channel splatting, arbitrary write masks, and four-component dot product operations... but where's the fun in that!?)
It seems that, at least for vector units without cheap swizzling, UYVY is a better match for BGRA image formats than YUY2 due to the way that channels line up. I've been trying to think of where YUY2 might be more appropriate, but the best I can come up with is ABGR, which is a rare format. The other possibility is that someone was doing a weird SIMD-in-scalar trick on a CPU that involved taking advantage of the swapped channels; doing an 8 bit shift on an 80286 or 68000 would have been expensive.
On the other hand, selecting a proper Y sample can be done by one dp3 with a filter texture, so it's not that bad.
Haali - 04 01 07 - 01:52
True. UYVY only needs one extra alpha instruction for that, though, and I usually have more alpha slots than color slots available.
Phaeron - 04 01 07 - 23:22
Why don't you use pre-calculated 3D lookup tables (16 MB) for YUV RGB conversion? Works in real-time on the CPU even in HD and can probably adapted into the graphics card for speedup.
tv-player (link) - 12 01 07 - 08:14
A bit hard on the cache, don't you think? Also, it's 48MB (256x256x256x3). Considering the work it takes to pack the indices together for a 3D lookup on the CPU, and the cache thrashing, a much smaller set of Y/Cb/Cr tables would be competitive. The trick is to pack the RGB impacts 12:10:10 in three 1K tables, do rgb = Y_table[Y] + Cb_table[Cb] + Cr_table[Cr], and then split out the channels.
As for the graphics card, you'd need at least pixel shader 1.4 to do that, as ps1.1 isn't flexible enough (you can't lerp Y, bias, or lookup with alpha). You'd also need a 64MB 3D texture for full accuracy unless you had trilinear filtering, which requires a fairly new graphics card... NVIDIA's had it for a while, but I think you need an X300+ to get it on ATI. And you still have to do the chroma interpolation and the luma switching.
Phaeron - 12 01 07 - 23:37
One begginer question :-)
How effectively move YUV data from yuv surface to rgb surface to have input to pixel shader based conversion?
I'm trying to implement color conversion routines as described here to avoid color conversion inside VMR9 (get yuv surface from it, copy and then convert by pixel shaders to for display), now I have it done by StretchRect...
Piotr Wozniak - 26 01 07 - 05:07
That depends on how you're processing it. I simply uploaded the data to an A8R8G8B8 texture, so that Y1/Cb/Y2/Cr ended up in B, G, R, and A channels, respectively. There isn't a good way to handle the interleaved formats. If you use UYVY/YUY2 textures, or R8G8_R8B8/G8R8_G8B8, you're at the mercy of how the GPU handles interpolation of the chroma components (usually, badly). If you use A8L8 then luma is easier, but chroma interpolation is harder. The planar formats are easy, because you can just use A8 or L8.
Phaeron - 26 01 07 - 05:15
Output will be always A8R8G8B8 for rendering only, and inputs I think YUY2, NV12 and YV12 are must since these seem to be most commonly used in DirectShow decoders.
And if you are saying that just copy data, then not full rgb surface is filled with data? Do I understand it correctly?
Piotr Wozniak - 26 01 07 - 06:07
OK it was quite easy to make it working but I have performance problems. I'm using memcpy to copying data from one locked surface to another. Code looks like this:
unsigned int * pSrc = (unsigned int *)lrectsource.pBits;
unsigned int * pDst = (unsigned int *)lrectdest.pBits;
const int srclinesize = m_VideoWidth * 2;
for (int i = 0; i < m_VideoHeight; i++)
memcpy((void *)pDst, (void *)pSrc, srclinesize);
pDst += lrectdest.Pitch / 4;
pSrc += lrectsource.Pitch / 4;
And it is very slow! With small 640*272 surface it takes 100% of my P4 2.4... When I replace memcpy with memset it drops to 20% (with decoding).. Hmm is there any way to fasten this?
Piotr Wozniak - 28 01 07 - 15:36
Not sure what type your source surface, but your perf problems would seem to indicate that you are reading from video or AGP memory, which is VERY slow. These would include: Direct3D surfaces, Direct3D textures in default pool (especially dynamic textures), and DirectDraw surfaces. If this is the case, then you should override the allocator to force system memory surfaces.
Phaeron - 28 01 07 - 16:08
Well, source yuy2 surface is plainf offscreen surface created in default pool, that is what is requested from VMR9, there is no much choice here. Destination argb surface is from texture created also in default pool and dynamic to make it lockable. I don't know if I have possibility to avoid this copying. As far as I understand surface for VMR9 must be in device memory to be accesible to driver for dxva operation (for me this is mostly for deinterlacing).
And since only way to copy surface to surface in device memory is StretchRect... Argh it looks like writing own renderer wasn't that bad idea ;-)
Piotr Wozniak - 29 01 07 - 02:48
Yup, you're definitely reading from video memory, which means uncached reads. You need to avoid that GPU-to-CPU copy.
You do have the option of splitting the allocator chain at your filter. This would allow you to perform the copy on the GPU and then copy that into the surface that VMR9 wants. The problem you have is that there is no good way in Direct3D to copy VRAM-to-VRAM into an offscreen plain surface, which is what VMR9 frequently uses. If you can place a custom allocator-presenter on VMR9 -- which is already sketchy if you're trying to make a standalone filter -- then you might be able to force render target textures, which you can copy into either via rendering or by UpdateSurface/Texture(), and also exposes default pool surfaces for its mip levels. Problem is, the chances of the driver being able to create a DXVA render target texture are basically zero. You might have a better chance if you could control the format and type of the surfaces, but VMR9's Allocator-Presenter interface is a bit too inflexible in the way that it can request pretty much anything from you.
There is always the GetRenderTargetData() route, which can produce a respectable 400MB/sec on a fast AGP machine, but unfortunately it's difficult to call that and not stall the pipeline every time. OpenGL gives you a lot more freedom in copying textures and surfaces on the GPU than Direct3D9 does. :-/
Phaeron - 29 01 07 - 04:26
On my system VMR9 requests only offscreen surfaces, nothing else no matter what input is. When I allocate in other pool then nothing is rendered on surface (driver doesn't have access to it?).
So combining proper, manual color conversion (avoiding StretchRect) and dxva is not possible. There is always copying data between gpu and cpu. For dxva we need offscreen surface in gpu but to manually transfer it to other surface without color conversion it must be copied to cpu and back to gpu...
Piotr Wozniak - 29 01 07 - 05:15
I have found something interesting. In VIDEOINFOHEADER2 structure there is field named dwControlFlags which was previously named dwReserved1 and required to be zero. In dwControlFlags AMCONTROL_COLORINFO_PRESENT can be set and then it can be casted to DXVA_ExtendedFormat structure where very detailed information about input format can be filled.
I don't know if these information is passed by any decoder at the moment, and how this is interpreted by VMR9 but it can be useful.
Piotr Wozniak - 31 01 07 - 05:14
I'm going to try using GLSL for YUYV (YUY2) to RGB colour conversion. After thinking about it for a while, I think I shall do this:
Load the raw YUYVYUYV... data into two textures - once as L8A8 and again as RGBA8. Then stretch the RGBA one horizontally by 2, load the textures into two texture units, and use a fragment shader to combine them.
As far as I can see it should work and interpolate properly. Comments?
Tim - 16 07 07 - 09:53
Yeah, that should work -- both NVIDIA and ATI support A8L8 natively. The trick is not uploading the texture twice to the video card. It's impossible in Direct3D, but I think you could do it in OpenGL by pushing it to a pixel buffer object and loading the textures from that. Trouble is, I don't think ATI supports PBOs. Dunno if it'd be faster than just reading the texture directly, either.
Phaeron - 17 07 07 - 02:01