¶Capture timing and capture sync
When I first started doing digital video in Windows, I started with video capture, since I was looking for ways to abuse the TV-tuner style PCI card I had just gotten. Back then I think I was running Windows 3.1 on an 80486 CPU, so just capturing postage stamp size video (160x120) was taxing, and half-size video (320x240) was possible only on a good day with a tail wind — and, of course, enough hard drive space, which I didn't have nearly enough of. And that was assuming that the cosmos was nice enough to give me enough contiguous memory below 640K for the framebuffers, which occasionally was thwarted by a particularly frustrating sound driver.
Thus, at the time, I wasn't particularly concerned about audio sync drift.
Nowadays I have far more CPU, disk, and memory bandwidth than is necessary to support analog video capture, and I can dump 720x480 video uncompressed to disk at full frame rate without problems. When you capture for more than an hour at a time, however, suddenly that little 0.5% deviation in clock rate starts to show up in the video — and tweaking the frame rate of the output over and over until it looks and sounds right starts to get tiresome. Soon, capture programs with audio synchronization capabilities were born, of which VirtualDub eventually joined the pack in tardy fashion. These attempted to analyze timing on the fly and correct for deviations in timing, thus bringing audio and video closer in sync. And usually, they were more than adequately successful.
VirtualDub's current audio synchronization mechanism sometimes works, and sometimes doesn't. I'm working on tweaking it again within the 1.6.x series to improve its effectiveness. In the meantime, here's some of what I've learned along the way.
To play back captured audio and video in synchronization, you need to know which audio samples match up with which video samples, or equivalently, be able to match both up to a common time base. Timestamps are the most direct way to do this, but a rigid correspondence between samples and time also works, as AVI does. So all we have to do is record the timestamp at which each video sample or audio block arrives and we're done, right?
Well, not exactly.
Sure, AVICap (VFW) and DirectShow both give you timestamps to at least 1ms precision, but the actual precision of the timestamps can be rather bad, as bad as 10ms. And for audio devices, you may not get timestamps at all. Already we're at a disadvantage, since we have to infer the audio timestamps, probably by trying to read the video or system clock as soon as we get a block of audio samples. That adds additional latency which we have to account for and which, more importantly, we don't actually know. Not to mention that there is noise introduced by the time between the audio interrupt and when we actually receive the block and take the timestamp. So basically we have audio and video timestamps of rather dubious quality to work with.
A more serious problem is that you don't actually know when the two streams start relative to each other. Hopefully they're as close as possible, but there's some error to account for there. More on this later.
The final, and actually biggest problem, is that the audio and video devices don't capture at the same rate. The video capture device is using a clock of some sort, but it has some small error from real time. The audio capture device also uses a clock — a different one from video, unless you have both combined in one unit — and that has a different error from real time. Rather than capturing at 44100Hz exactly, it's actually producing audio samples at 44089.1Hz or something slightly off, a tiny error that can start to become visible after an hour or so. On older cards, this error can be much larger; I had a card once that was as far off as 43400Hz. To correct for this error, we need to estimate the difference in rate between the streams and correct one or both streams to compensate.
Measuring the stream timing
The direct way to measure rate is to take the period of video samples or audio blocks, which is the difference between adjacent timestamps. Do this once, though, and you'll get a really noisy sample; an error of 5ms out of 33ms in a video timestamp is huge. An imprecise clock can be one source of problems, background task interference is another, and audio buffering is a third. A subtle fourth source of error is that the capture program can only service audio and video synchronously; in particular, if video takes a long time to compress, it can delay the receipt and timestamping of the next audio block. I think that with DirectShow it may be fairly easy to push the actual audio and video processing to another thread, which would allow the capture thread to timestamp everything very quickly, but I haven't tried it yet.
It goes without saying, but obviously Windows XP is not a real-time operating system (RTOS).
There is a similar problem with determining the start of the stream, as just taking the first sample is similarly noisy. To get this error down, multiple measurements have to be combined to produce a statistic with lower variance and thus better accuracy. Averaging is the usual way to do this, which scales the standard deviation by 1/sqrt(N) for N data points. However, in the case of the frame period, it doesn't work:
x = [(x1 - x0) + (x2 - x1) + ... + (xN - xN-1)]/N
x = (xN - x0)/N
All of the adjacent differences cancel and what you end up with is the difference between the last and first samples divided by the number of frames. The variance, therefore, isn't improved with the number of samples, and the estimate never gets any less noisy.
Is the latency difference significant? Usually, not really. A 10ms constant error is hardly noticeable. On a USB-based capture device with hardware compression, though, I've seen the video stream delayed by 300ms compared to audio. You don't want to run an audio resampler with that kind of estimation error.
The solution I came up with for 1.6.2+ is to compute a linear regression fit to the samples instead, which produces a slope (the frame period) and an x-intercept (the start of the stream). This assumes that the actual frame period relative to the time base is constant over time, which it isn't (clocks can drift in speed due to changes in temperature), but it seems to work well enough, and more importantly, if the linear fit is good the variance in the outputs drops as more frames are captured. We can do this for both the audio and video streams and determine both the relative difference in starting point and rate. This isn't necessary, though; all we're actually concerned about is the current error between the two streams, and constant driving that difference to zero. This takes care of both differences in start and rate.
Missing frames are most straightforwardly dealt with just by looking for abnormally long delays between frames and inserting pad frames in the output file to take up the extra time. This is standard operating procedure for many capture programs in Windows, and contributes to the evil "dropped frames" statistic. If your system can't keep up with the incoming frames, the hardware will start dropping frames to make room for new ones, and so you'll get a bunch of holes in your output file, leading to an unsatisfying, stuttery video. Note that this doesn't necessarily mean that the audio sync in the file is screwed; this is a common misconception. The purpose of putting padding frames in the output file when dropped frames have been inferred is to space out the frames that were captured to match the actual timing as closely as possible. As an example, have VirtualDub process a video in Direct Stream Copy mode with it set to convert to a 120fps frame rate. You'll find that about three-quarters of the frames in the output file are pad frames (zero-byte), but audio sync will be preserved.
Unfortunately, the heavens aren't so nice as to deliver a clean, wide 67ms timestamp delta where there is one frame missing; noisy timestamps mean the exact delta can vary quite a bit, and even worse, interesting glitches in the signal can mean that the delay isn't an integral number of frames. That means the errors have to be accumulated over time, to keep subframe errors from accumulating and gradually wrecking audio sync. Doing this ensures that the error doesn't grow larger than half a frame, at the cost of a periodic single-frame glitch. An alternative solution is to tweak the video timing to cover the small errors, basically resampling both audio and video. This can work for small rate errors or field drops, and I'm experimenting with such a system, but it's important to have an escape hatch that turns off the system and forces pad frames in case of a bad glitch, like a half-second interruption. I've found that if this isn't done, the resamplers go nuts trying to cover the gap between frames, resulting in some chipmunkiness in the audio.
Large disturbances can also result in the correlation between audio and video steps being a bit more broken than a line. Maybe the hard drive got sidetracked for a while by another task, or the clock just decided to jump forward a few seconds (which I have seen happen). This means that the linear regression fit becomes poor, and the error in the relative latency estimate rises. If the break is bad enough, this can cause the estimate to rise far enough that the audio resampler actually desynchronizes the audio. I haven't come up with a really good solution for this yet. A possibility is to check the variance in the estimate and restrict the estimation to only recent samples if that helps. Doing this, though, requires (1) a heuristic to pick the best break point, and (2) a very fast way to do the search. The estimated variance of the estimate (ugh) might be enough for the first, and I think that either sum tables or a tree-like summing structure might work for the second problem.