§ ¶GPU acceleration of video processing
I've gotten to a stable enough point that I feel comfortable in revealing what I've been working on lately, which is GPU acceleration for video filters in VirtualDub. This is something I've been wanting to try for a while. I hacked a video filter to do it a while back, but it had the severe problems of (a) only supporting RGB32, and (b) being forced to upload and download immediately around each instance. The work I've been doing in the past year to support YCbCr processing and to decouple the video filters from each other cleaned up the filter system enough that I could actually put in GPU acceleration without significantly increasing the entropy of the code base.
There are two problems with the current implementation.
The first problem is the API that it uses, which is Direct3D 9. I chose Direct3D 9 as the baseline API for several reasons:
- It's the API I'm most familiar with, by far.
- The debug runtime is much more thorough than what I've had available with other APIs.
- PIX and NVPerfHUD are free.
- It runs on just about any modern video card.
- Shaders have well-defined profiles, are portable between graphics card vendors, and use standardized byte code.
On top of this are a 3D portability layer, then the filter acceleration layer (VDXA). The API for the low level layer is designed so that it could be retargeted to Direct3D 9Ex, D3D10 and OpenGL; the VDXA layer is much more heavily restricted in feature set, but also adds easier to use 2D abstractions on top. The filter system in turn has been extended so that it inserts filters as necessary to upload or download frames from the accelerator and can initiate RGB<->YUV conversions on the graphics device. So far, so good...
...except for getting data back off the video card.
There are only two ways to download non-trivial quantities of data from the video card in Direct3D 9, which are (1) GetRenderTargetData() and (2) lock and copy. In terms of speed, the two methods are slow and pathetically slow, respectively. GetRenderTargetData() is by far the preferred method nowadays as it is decently well optimized to copy down 500MB/sec+ on any decent graphics card. The problem is that it is impossible to keep the CPU and GPU smoothly running in parallel if you use it, because it blocks the CPU until the GPU completes all outstanding commands. The result is that you spend far more time blocking on the GPU than actually doing the download and your effective throughput drops. The popular suggestion is to double-buffer render target and readback surface pairs, and as far as I can tell this doesn't help because you'll still stall on any new commands that are issued even if they go to a different render target. This means that the only way to keep the GPU busy is to sit on it with the CPU until it becomes idle, issue a single readback, and then immediately issue more commands. That sucks, and to circumvent it I'm going to have to implement another back end to see if another platform API is faster at readbacks.
The other problem is that even after loading up enough filters to ensure that readback and scheduling are not the bottlenecks, I still can't get the GPU to actually beat the CPU.
I currently have five filters accelerated: invert, deinterlace (yadif), resize, blur, blur more, and warp sharp. At full load, five out of the six are faster on the CPU by about 20-30%, and I cheated on warp sharp by implementing bilinear sampling on the GPU instead of bicubic. Part of the reason is that the CPU has less of a disadvantage on these algorithms: when dealing with 8-bit data using SSE2 it has 2-4x bandwidth than with 32-bit float data, since the narrower data types have 2-4x more parallelism in 128-bit registers. The GPU's texture cache also isn't as advantageous when the algorithm simply walks regularly over the source buffers. Finally, the systems I have for testing are a bit lopsided in terms of GPU vs. CPU power. For instance, take the back-of-the-envelope calculations for the secondary system:
- GPU (GeForce 6800): 2600Mpix/sec * 4 components/vector = 10.4 billion operations/sec
- CPU (Pentium M): 1.86GHz * 8 components / vector / clock = 14.8 billion operations/sec
It's even worse for my primary system (which I've already frequently complained about):
- GPU (Quadro NVS 140M): 3200Mpix/sec * 4 components / vector = 12.8 billion operations/sec
- CPU (Core 2): 2.5GHz * 16 components / vector / clock = 40 billion operations/sec (single core)
There are, of course, a ton of caveats in these numbers, such as memory bandwidth and the relationship between theoretical peak ops and pixel throughput. The Quadro, for instance, is only about half as fast as the GeForce in real-world benchmarks. Still, it's plausible that the CPU isn't at a disadvantage here, particularly when you consider the extra overhead in uploading and downloading frames and that some fraction of the GPU power is already used for display. I need to try a faster video card, but I don't really need one for anything else, and more importantly, I no longer have a working desktop. But then again, I could also get a faster CPU... or more cores.
The lesson here appears to be that it isn't necessarily a given that the GPU will beat the CPU, even if you're doing something that seems GPU-friendly like image processing, and particularly if you're on a laptop where the GPUs tend to be a bit underpowered. That probably explains why we haven't seen a huge explosion of GPU-accelerated apps yet, although they do exist and are increasing in number.
Download from gpu could be overlapped with the compression of the previous frame, and I don't really know why but CopyResource of dx10 is a lot faster than GetRenderTargetData in dx9.
Gabest - 27 05 09 - 03:38
But in general, writing a fast GPU shader is much easier and clearer than writing optimized assembly to do the same thing. I'll take the faster development time for the same performance in a heartbeat.
And, of course, if there are no blocking frame dependencies, paralellizing CPU and GPU filters across frames would let each do what they're better at.
And once you have an algorithm that needs to operate in floating-point internally, or has texture access patterns less friendly to the CPU ...
Glenn Maynard - 27 05 09 - 03:39
Another idea: lock and copy with sse4's movntdqa, if I understand it correctly this instruction was made exactly for that task, but they are probably already using it deep inside GetRenderTargetData ^_^
Gabest - 27 05 09 - 03:45
... and to add something to the gpu vs cpu debate, I'm a bit disappointed that AVX only extends the ISA to floats and larrabee's smallest data unit will be 32 bit. 16 bit per color component would double the bandwidth for free, but they only like floating point nowdays.
Gabest - 27 05 09 - 03:53
Have you also investigated using systems like CUDA instead? They seem more suited to this task, but support has been a bit flaky when I last looked at them (a few years back).
sagacity - 27 05 09 - 03:55
CUDA has a number of different optimized asynchronous data streaming options, you might want to investigate them. (You can easily combine CUDA & DX) Of course, if your hoping to be GPU/CPU independent maybe waiting for OpenCL might be a better plan.(Any minute now... AMD&nVIDIA both have dev. drivers available (CPU/GPU)). Comparing those old GPU performances isn't really a great comparison either, you'll find the new generation of GPUs will far out perform the CPU for image processing operations, even with the data copying overheads.
Adrian - 27 05 09 - 04:01
Amusingly in games people are working on switching back from GPU processing to other types of processing.
CUDA filters can be faster than normal GPU filters, and will be much faster on future CUDA hardware.
On the PS3 , SPU filters are way faster than RSX filters.
On LRB, filters designed to run as x86 code will be way faster than filters that are written as GPU shaders.
The shader model is just not well designed for filter-like operations.
For my product Oodle I'll be providing some filter-like ops that can run either as CPU code in little work fibers, or as GPGPU code on hardware. I'm guessing that most people on modern CPUs like Core i7 will have more free CPU cycles than GPU cycles.
cb - 27 05 09 - 12:30
I agree that it's easier to make a decently optimized shader than a CPU routine, which is one of the reasons I'm investigating this path. If possible, I'd like to support CPU emulation as well. One thing I'm finding, though, is that debugging the shaders is harder. NVPerfHUD doesn't work on my primary machine, so I'm forced to use PIX, and debugging a shader with PIX is a lot more annoying than debugging CPU code with Visual Studio. I find pixel shader debugging in Visual Studio to be almost useless since you have to use refrast and you can't select the pixel you want to debug.
I disagree that shaders are a poor fit for filters -- they're a very good fit for convolution filters, which can implement a wide variety of processing algorithms. The main thing missing is the ability to apply a texel offset in the shader instead of having to pass 1/w and 1/h directly. I believe the Xbox 360 has this via a tex2DOffset() intrinsic in its xps_3_0 profile.
Problems with CUDA, from what I can tell so far:
- NVIDIA only.
- GeForce 8 and up only.
- Opaque compiled format, which would make it impossible to retarget CUDA filters to any other API.
- Debugging problems. You can recompile a CUDA kernel to run on the CPU, but the documentation says this is only possible with the high level API, and I would probably have to use the low level API.
- Undefined threading model. I need to be able to start and stop the acceleration engine on a worker thread and it may not be the same thread each time.
Whether or not it would be faster than a pixel shader depends on whether the texture unit is the bottleneck. Faster readback would be nice, but is less important if the GPU ceases to be the bottleneck (I can poll events to prevent the readback from stalling the CPU, limiting the problem to synchronization).
OpenCL looks more interesting, but it isn't widely available yet. NVIDIA's implementation is still in closed beta. It also looks like the program model for non-embedded systems is source, which is scary.
Because the plan is to export this to the plugin API, the requirements are significantly tighter than for VirtualDub itself: I need an API that is widely available, easy to program, and that will have a long lifetime. Asking plugin authors to write their algorithm six times and test it on five machines is not an option, nor do I want to deal with filter chains that have filters using five different acceleration APIs.
Phaeron - 27 05 09 - 14:50
While I was working in a completely different area (and used CUDA), my experience was that the GPU is _only_ faster because (i.e. when) it has a vastly higher memory bandwidth (at much higher latency though, that's why you need obscene amounts of parallelism).
Why? Because if the calculation is complex you often don't get the parallelism you need or waste performance in some other way for the GPU implementation, particularly if you can't use the hardware as efficiently because you are using shaders and not CUDA (do I understand you right you even have to use floats and can't do calculations on byte arrays?). If the calculation is simple, the CPU definitely is bound by the memory bandwidth.
So if you have the kind of laptop GPU that uses the main memory, it's usually completely useless.
And even the "proper" GPUs in laptops often will have a scaled-down memory connection.
Now just in case the memory bandwidth is the issue you now have the additional issue that upload and re-reading the image from the graphics card can already use up as much memory bandwidth as calculating the filter on the CPU.
Even though my project was (mostly) successful, I think it is a really thin line of algorithms where the GPU can be used _really_ successfully.
Reimar - 03 06 09 - 06:26
Not sure I agree, because the GPU really does have a lot of compute bandwidth... when dealing with 32-bit floats. Where it gets screwed is when you're dealing with smaller quantities.
As far as I can tell from the CUDA and PTX documentation, and what I know about the GeForce 8 architecture, dealing with quantities smaller than 32-bit only has one benefit, and that's memory bandwidth. The ALUs are all scalar and don't work with widths below 32-bit -- the best you can get is to work with 16-bit ints and even then you're still only working on one value at a time. You can load bytes and words, but not work on them any faster. That means that when the compute width drops to bytes, the CPU suddenly gets a 4x boost because it can slice its ALUs and the GPU can't. The CPU also appears to have other interesting advantages in this area. I use averaging a lot at the byte level, which is implemented in modern CPUs for MPEG prediction and executes extremely fast since it's just a perversion of the adder; on the GPU, you have to emulate it with add/mul/mad ops instead and can end up slightly behind.
CUDA has the advantage of a sane and asynchronous readback API, an easier programming model, on-chip mutable storage, and direct linear memory access. Fragment programs (pixel shaders), on the other hand, might have access to a ROP that CUDA programs don't, with free packing in particular being potentially lucrative. I think what it comes down to is whether the texture units are a bottleneck that can be overcome with CUDA direct memory access, which can pull contiguous memory very quickly. Compute-wise I don't see a difference. I haven't played around with CUDA enough to be able to tell, but one thing's becoming clear: it'll be more of a pain due to the alignment requirements (esp. compute model pre-1.2). Problems dealing with edge conditions also resurface, which is otherwise very nicely handled by the texture unit, and one of the biggest PITAs I have to deal with when writing CPU routines.
One thing that I noticed is that on the GeForce 8, most integer ops seem to run at full speed compared to FP equivalents. That means SWAR tricks for doing narrow integer 2-vectors or 4-vectors in 32-bit integer math become potentially lucrative. Unfortunately, with 32-bit ALUs, it's not that interesting; averaging 4x8, for instance, takes four ops and is a wash. I seem to recall that ATI relegates more integer ops to scarce units, too, so doing this in DX10 is probably a good way to really tank a bunch of graphics cards.
Phaeron - 04 06 09 - 04:10
First, when I used it the readback API of CUDA was not that sane, you either had to add explicit sleeps or it would do busy-waiting for the GPU to complete. Probably still better than what you have to use for shaders but still rather annoying when there's work to do for the CPU.
And I hope I never claimed that using byte-wise calculation/CUDA would improve calculation speed (though they would for operations like shifts, xor, and, or etc., the kind of thing that is SIMD-able with pure C types and code).
My argument was about memory bandwidth. Though I just realize that this might be nonsense, since shaders can read from 8-bit-per-component textures and write to framebuffers in such a format.
Having shared memory directly accessible might still help, and IMO eliminates most problems due to alignment requirements. I also think the alignment requirements do not exist when you go through the texture cache, i.e. using texture reads (which also fixes you edge conditions). To be honest, to me that seemed like the only good use for the texture cache, the texture cache has the same latency and bandwidth as the main memory - at least on the bigger cards. Well, and maybe when you can't figure out how to use the shared memory efficiently for caching.
Reimar - 06 06 09 - 06:09
I wouldn't touch CUDA with 10 foot pole -- I am sick of proprietary closed APIs which don't work on all hardware. On the other hand writing a filter for CPU guarantees that it will work on Larrabee.
What pisses me off is that not a single video codec author hasn'r recognized the oportunity to add a hook for deinterlacing into encoding pipeline so we don't have to do motion estimation twice and that nobody considered offloading motion estimation for encoding to the GPU.
Igor Levicki (link) - 17 06 09 - 11:48
Your comment that "I disagree that shaders are a poor fit for filters -- they're a very good fit for convolution filters, which can implement a wide variety of processing algorithms" has just given me a completely insane idea for a project I've been thinking about.
For a while now I've been toying with the idea of capturing the output from a Gamecube or a Wii on my computer and then writing a program that controls the console based on the video output (possibly by then getting the computer to pretend to be a cube/wii controller). I was going to just grab the video with a TV card and then process it on the CPU, but given that I'm mainly going to be looking for patterns (and if I remember my Computer Vision course right, convolution is a good way to do that), I've just had the idea of using a graphics card with onboard video capture. The idea is that the onboard capture chip dumps the video straight into the video memory, where I can then run shaders on it looking for whatever pattern I'm after. I then just read the result off the card (most likely a greyscale texture), and use as the input to whatever console-controlling program I end up with.
The only trouble is the only card I have with onboard capture is a GeForce 4. In any case, it's a completely insane idea and so must be tried!
Torkell (link) - 26 06 09 - 18:38
One problem with doing that is that I don't know of a way to get capture data directly into a place where CUDA or a shader could access it. The hardware is almost certainly capable of doing it, even if the capture device is separate, but the OS and APIs don't allow it. Well, at least on Windows. Linux and V4L might.
Phaeron - 27 06 09 - 00:07