¶Filter system changes in VirtualDub 1.9.1
A major part of the work I did in VirtualDub 1.9.1 was to rework the filter system in order to support filters running in an N:1 frame configuration rather than lock-step 1:1 mode. This is particularly important for filters that need to reference a frame window, which is tricky to do in earlier versions. Making this change involved a lot more rework than I had anticipated and I learned a lot along the way.
To review, the filter system in 1.9.0 and earlier is based on a lock-step pipeline model where one frame goes in, each filter runs exactly once, and one frame comes out:
frame = input[i];
for each filter F:
frame = F.run(frame);
output[i] = frame;
This is relatively simple, and also memory efficient: only two main frame buffers are ever active at a time, and thus they can be overlapped to save space. In 1.4x, this was extended to accommodate internal delays by tracking the delays of all the filters and compensating for it at the end. In 1.8.x, this was further extended to support frame rate changing, resulting in the slightly more complex flow:
delay = 0;
for each filter F in reverse:
i = F.prefetch(i);
delay += F.get_delay();frame = input[i];
for each filter F:
frame = F.run(frame);output[i + delay] = frame;
A lot could be done with this model, but there were several shortcomings -- filters that did change the frame rate tended to do ugly things to upstream filters due to requesting duplicate frames, and lag wasn't compensated on rendering, not on the timeline. In addition, the need for filters to buffer frames internally to support windows added complexity and slowed down rendering speed.
I should point out some of the similarities to Microsoft DirectShow at this point. DirectShow also uses a push model, but it has two additional weapons that improve its performance. One is that each filter can control whether it passes frames to the downstream and how many frames are passed, which accommodates variable frame rates. The second is that it uses refcounted buffers with flexible allocators, which means that filters can cache frames as they pass through simply by calling AddRef() on them. I had thought about switching to a model like this, but decided against it as it still had some of the same shortcomings with regard to random access and serialization between filters.
In 1.9.1, the filter system was rewritten from a push model to a pull model, with each filter being modeled as a function with a frame cache after it. The pipeline then runs a series of independent filters that all use logic like this:
for each new request:
if the requested frame is in the cache:
return the cached frame
build a list of source frames needed
issue a prefetch request to the upstream filter for each source frame
queue the requestwait until a request in the queue is readyif any source frames have failed on the request:
fail this request
process the frame
cache the result
This model is similar to that used by Avisynth, in which each filter fetches frames from the upstream and then pushes a result frame downstream through a cache. The main difference, however, is that fetching source frames and processing those into an output frame are decoupled. The reason I did this is that it allows filters to run in parallel, which is particularly important for the head of the filter chain where frames are read from disk and you don't want to stall frame processing on I/O. The 1.9.1 pipeline queues up to 32 frame requests with default settings, and that keeps the I/O thread busy without stalling the processing thread as in earlier versions.
Those of you who have been tracking the VirtualDub source may have noticed a period of time when I had attempted to rewrite the filter system in this fashion before and had shipped it as unused modules in the source. The main problem with my earlier attempt was that I had tried to make the entire filter system multithreaded, which turned into a nightmare, both in terms of stability and memory usage. The stability problem comes with the need to make nearly everything thread-safe, which is difficult when you run into situations like a filter request completing on another thread while the RequestFrame() function is still running. The memory usage problem comes into play when upstream filters are able to produce frames faster than the downstream can consume them -- more on that later. A third problem is that although I got the new filter system running, it was never completely integrated into the rest of the engine, nor did it ever get to the point of fully supporting the existing filter API. A new and better system that doesn't cleanly interface with anyone who needs to use it is pretty useless. As a result, I eventually abandoned that branch and continued evolving the existing filter pipeline through 1.9.0.
There are, of course, a number of subtleties to getting a system like this working.
In-place filters. VirtualDub supports two main buffering modes for video filters, swap and in-place. In swap mode, filters are asked to process a source frame buffer to a separate destination buffer; in in-place mode, the filter receives only one frame buffer and modifies the pixels in-place. The choice of mode is up to the filter and whichever one is better depends on the filter algorithm. When it is convenient to implement, in-place mode is more efficient because it only uses one buffer, which means less memory for the CPU caches to deal with and less pointers to maintain in the inner loop. Filters that are purely a function on each pixel value or are doing simple rendering on top of the video can usually work this way. Caching frames throws a kink into this mode, however, because in-place processing destroys the source frame and that means it can't be cached. This can be a severe performance problem if two or more clients are pulling frames from a filter and one of the downstream clients is an in-place filter. 1.9.1 solves this by requiring that frame requests indicate whether a writable frame is needed and through a predictor that tracks frames even after they've been evicted from the cache. If within a certain window more than one request arrives for a frame and at least one of the requests is a writable request, the predictor marks the frame as shareable and in-place filters do a copy instead of stealing the source buffer.
Caching and memory usage. Caching is wonderful for avoiding redundant work, but Raymond Chen reminds us that a bad caching policy is another name for a memory leak. Doing a little more caching than necessary is fine when you're dealing with 100 byte strings; it's a bit more problematic when you are caching 3MB video frames. In 1.9.1, the frame caches are primarily intended to avoid redundant frame fetching at the local level, and thus have an aggressive trimming policy: the allocators periodically track the high watermark for referenced frames and continuously trim down to working set without allowing for speculative caching. This results in memory usage close to 1.8.8/1.9.0, and I have some allocator merging improvements in 1.9.2 to improve this further. I may allow for speculative caching in the future, but I'm not a fan of the "use 50% of physical memory" method of caching -- that generally leads to wasteful memory usage and also pretty bad swapping if three applications each decide to take 50%. Instead, I'd probably try to borrow some algorithms from virtual memory literature in order to predict cache hit rates based on past allocation pattern, since tracking frame requests is cheap compared to storing and processing the frames themselves.
Frame allocation timing. As I noted earlier, VirtualDub prefetches multiple frames in advance in order to keep the pipelines full. Initially I had rigged the filter system to allocate result frame buffers as soon as the requests came in, and that was a huge mistake as it caused the application to exceed 300MB when all 32 frame requests immediately allocated frame buffers all through the filter chain. The key to solving this turned out to be twofold: allocate result buffers on the fly as frames are processed, and always give downstream filters priority in execution order. The combination of having the upstream filters allocate frames as late as possible and the downstream filters process and release them as soon as possible results in memory usage that is no longer proportional to the number of requests in flight. Note that this simple strategy only works if only one filter can run at a time, as in the parallel case an upstream filter can continue to run and chew through frames -- that's a bridge I'll have to cross later.
Memory allocation strategy. The easiest way to implement the frame allocators is simply to use new/malloc to allocate the frame buffers. If you try that, though, you quickly find that fragmentation and expansion of the memory heap is a problem. Early 1.9.1 builds used that strategy and the result that VirtualDub very quickly exceeded 50MB+ and stayed there even after it had shut down the filter chain. The final version uses VirtualAlloc() for allocating frames over a certain size, which largely sidesteps the problem since the OS is forced to commit and decommit pages; this is fairly easy since the buffers are fixed size, and the allocators recycle buffers to avoid excessive allocation traffic. Small buffers still go into the heap, which I may fix at some point with bundling.
I have some ideas on future work in the filter system, too. As usual, I can provide no guarantees as to if or when any of this might be implemented.
Bugs. Yeah, it's buggy in some areas, most notably filters that declare a lag (delay). The known issues will be fixed in 1.9.2 and definitely before the 1.9.x branch goes stable. (Side note: It turns out I forgot to cross-integrate the 1.8.8 fixes into 1.9.1. Oopsie.)
Multi-threading. The VirtualDub filter API doesn't allow an individual filter instance to be run in parallel, but it does allow separate instances within the same chain to execute concurrently, because filter instances never talk to each other. Current versions don't do this and serialize everything except disk I/O in the frame fetcher. I once tried to multithread the filter system by making the entire filter system thread-safe, which was a mess I'm not keen to repeat. The way I would try to do it now would be to keep the entire filter system single-threaded, including all frame management and sequencing, and only farm out individual calls to runProc() on filters.
32-bit/64-bit interop. An annoyance with 64-bit applications in general on Windows is that they can't use 32-bit DLLs. That includes codecs and filters in this case. One thing that would be nice would be the ability to use 32-bit filters from the 64-bit version. Doing this requires a mechanism for interprocess communication (IPC), as well as getting the frames across the process barrier. Copying frames through IPC is an expensive proposition, so using a shared memory mapping between the processes is likely the way to go. That requires that the frames be allocated through CreateFileMapping(), which is a big change from heap allocation, but trivial if they're currently allocated via VirtualAlloc() (see "memory allocation strategy" above).
Frame caching to disk. It'd be nice to be able to cache frames to disk when the multiple passes are required and the cost of reading and writing a frame to disk is a lot lower than computing it. Unfortunately, this is difficult to do currently because although a filter can tell what requests may be coming via the prefetch function, it isn't able to track which of those are in flight. I'd need to allow some sort of tagging system to allow this.
Avisynth compatibility. This is something that I kept in mind, although I haven't actually tried to do it. Currently Avisynth is able to run some older VirtualDub filters through a wrapper on the Avisynth side, although it doesn't support the new prefetch features. I tried to make the new API compatible such that it would be possible to make a dual mode filter that work directly as an Avisynth filter through a layer that calls the prefetch half of the filter, fetches the frames, and then runs the processing half. This layer may be something I add to the Plugin SDK at a later date. Going the other way -- VirtualDub natively running Avisynth filters -- is more difficult since the merged fetch/process GetFrame() function is incompatible with the split prefetch/process model. I had experimented in the past with using fibers to suspend and resume Avisynth filters to some success, but doing this fully requires the ability to do a late prefetch from the runProc() function and push the frame back into the waiting queue, which the filter system currently doesn't support.
3D hardware acceleration. I've wanted to come up with a general API for this for a long time, but couldn't ever come up with something I liked. Multithreading support is a preferred dependency for getting this running, but the main problems are (a) which API to use and (b) shaders. I don't like either OpenGL or Direct3D straight for 2D work as there are too many sharp corners and opportunities for API usage errors, so I'd really like to wrap it. Shaders, however, throw a huge kink into the works because I haven't found a shader compilation path that would work well. I don't like the idea of having filters embed Direct3D shader bytecode, and for various reasons I do not want to have a dependency on D3DX or Cg in the main application. GLSL is promising, but I've found that OpenGL implementations on Windows tend to be lousy in general at reporting errors.