Pipelines and performance - virtualdub.org

¶Pipelines and performance

VirtualDub, although not really designed for multi-CPU systems, is moderately multithreaded. On an average render it will use about 4-5 threads for UI, reading from disk, processing, and writing to disk. Trying to keep these threads busy is a challenge and to do so VirtualDub's rendering engine is pipelined -- all of the stages attempt to work in parallel with queues between them. The idea is that you add enough buffering between the different threads that they are all working on different places in the output and the stages only block on the single bottleneck within the system, which is usually either disk (I/O) bandwidth or CPU power.

In VirtualDub, pipelining parameters are set under Options > Performance in the editing mode; in capture mode, you can adjust disk buffering parameters in Capture > Disk I/O. But what do the parameters mean, and how should they be set?

First, a classic laundry analogy: think of pipeline stages in terms of your washer and dryer. The fact that your washer and dryer are separate instead of being one washer-dryer amalgam is that you can get more done by keeping both busy: while one load is drying, you can start the next one in the washer. However, there are some gotchas to this:

The washer and dryer probably don't run at the same speed, so one of them will always be the bottleneck. If your washer takes longer than the dryer, the dryer will be idle some of the time (pipeline stall). If your dryer takes longer, you'll sometimes have clothes sitting in the washer (underutilization).
The pipeline only runs as fast as its slowest stage. If your washer takes 30 minutes and your dryer takes an hour, getting a faster washer isn't going to speed up four loads of laundry as much as a faster dryer.
You can keep the washer and dryer running more often if you have an extra basket lying around. That way, you can empty the washer and fill the dryer as soon as they're ready, and you can merge or split loads as necessary. (When I was a college student, I quickly learned that a full washing machine will work about as well as a half-empty one, but a full dryer takes forever compared to a half-empty dryer, so sometimes it pays to dry smaller loads. Part of the reason is that you can empty the lint trap more often.) However, having ten baskets instead of one or two doesn't help and is just a waste of space.
There is some overhead to shuffling clothes into the washer, from the washer to the dryer, and out of the dryer; the longer you take, the longer the units sit idle and the lower the overall throughput. This means that putting one load through actually takes slightly longer than if you had both in one unit. The longer the washer and dryer take, though, the less your movement delay matters. One way to do this is to do bigger loads.
It's embarrassing to store up dirty clothes for three weeks and then have to tell people you can't do something because you have to spend the next three hours doing laundry.

Okay, what does this have to do with anything? I ain't washin' my video!

The first two points are about bottlenecks and are very important. If you are doing a high-bandwidth operation with Huffyuv or uncompressed video and only a light amount of processing, you are almost certainly going to be disk bound. This means that your CPU utilization is actually going to drop well below 100% because reading and writing from the disk is the bottleneck. At this point you largely don't care about MMX/SSE2 optimized code or other kinds of CPU-based performance optimizations, because the only thing they'll do is make your CPU run cooler. Conversely, if you're using a ton of video filters and your render is running at 1 fps, it probably isn't so bad to have your video files stored over the network, because the network will keep up just fine.

The third point is a bit stretched, but has to do with insufficient pipelining and the fact that the pipeline stages process discrete blocks, not a continuous stream. If you don't have enough buffering between two stages then the pipeline will flood when the first stage writes to it, causing the first stage to stall, and then empty when the second stage pulls data from it, causing the second stage to stall. When this happens you essentially are locking stages together (serialization) and losing the benefits of pipelining. This happens most often in VirtualDub when doing very fast Direct Stream Copy operations, where the render engine can process thousands of frames per second on a highly-compressed file because it's doing nothing but copying them into the write buffer. In this case having only 32 buffers is likely to be a bottleneck because the render engine is going to exhaust the pipeline buffer very quickly and then stall while waiting for the disk to bring in the next batch, so you'll want more buffers. Another way you get stalls like this is if you're filtering compressed video with frame rate decimation on, because the engine will decompress a bunch of frames in a burst and then sit down for a while to do some image processing. However, if you're processing 640x480 uncompressed video with some filters you really don't want 256 pipeline buffers because neither the disk nor the render engine can process frames that fast and you'll potentially waste a ton of RAM, as much as hundreds of megabytes. Using so many pipeline buffers that you swap is really counterproductive.

The fourth point has to do with block size. You want to process data in big chunks to minimize the management and shuffling overhead. The primary culprit in this department is I/O. Windows likes to do I/O in relatively small chunks (4K-64K), so VirtualDub has quite a bit of logic to defeat buffered I/O and force the kernel to do larger blocks. Between blocks the hard disk may have to seek somewhere else and that can cost 20-50ms, so you want to keep the blocks large. I like to aim for no more than about 10 blocks per second when tuning capture I/O, so when I capture uncompressed or Huffyuv I'll typically use a 1MB block size.

The fifth point is even more contrived but has to do with too large of a block -- specifically, I/O blocks. There are diminishing returns to larger and larger block sizes and there are latency problems as well. 1MB is likely to work just as well as 2MB, and gives the system more flexibility in scheduling. If you configure VirtualDub to do 8MB block writes to disk and then try to open Internet Explorer you're going to be waiting a while because every time IE wants to read a 4K file off disk it has to wait for 8MB of data to go through! The long latency associated with waiting for a large disk write also has implications for buffering efficiency, too. A disk buffer that is configured as 8 x 1MB can buffer more effectively than one split as 2 x 4MB, even though they're both 8MB, because with the 2 x 4MB buffer the front end can't put anything into the buffer until an entire 4MB block is finished, whereas with the 8 x 1MB buffer the front end can write and do something else as soon as 1MB has gone through. What this means is that there's a sweet spot for disk block size -- bigger isn't always better.

So, what signs should you look out for?

Highly fluctuating frame rate during a render may be a sign of insufficient buffering. Figure out whether you're disk or I/O bound, and then examine Task Manager or the HDD light. If you're CPU bound, you want that sucker at 100% CPU; if you're disk bound, you want that light always on and for the HDD to not make too much noise seeking. The real-time profiler will also help diagnose this, by showing if the threads are serializing against each other in chunks at a time rather than running smoothly in parallel. Generally, though, the disk and memory buffer defaults are fine; one of the queues is usually empty and the other full, and the buffers are large enough to cover any momentary bumps. As noted above, though, there are exceptions; in the direct stream case you should consider using more memory buffers, and if you're working with really big video frames, like 640x480 uncompressed, consider a larger file write buffer.

If you're dropping frames during capture when CPU utilization is low and the bandwidth isn't that high, you may be using too large of a disk I/O block size, such that the huge writes are disruptive. This is unlikely to happen with the defaults, but might if you're capturing to a slow device or to a highly fragmented drive, in which case you should consider lowering the block size. In VirtualDub 1.6.1 or higher, you can use the real-time profiler to detect this case, as it will show up as the audio or video channels blocking for long periods of time on a write. Note that there is currently a problem in VirtualDub's AVI write routines in that it periodically extends the file ahead of the current write position; this greatly reduces seeks to the directory and free space bitmap on disk, but unfortunately I recently found out that file extension is a synchronous operation even if overlapped I/O is used. (1.6 uses overlapped I/O under NT to pack disk requests back-to-back.) 1.6.3 does file extension in a separate thread and shouldn't show the video/audio channels blocking on writes unless the disk hitches for a moment and backs up the buffer.

If your render operation is projected to take 14 hours and its speed is better measured in SPF than FPS, just leave the buffering settings alone. Tweaking them isn't likely to speed anything up.

7 comments | Dec 19, 2004 at 03:34 | default

Current version

Navigation

Archives

¶Pipelines and performance

Comments