§ ¶Capture timing and capture sync
When I first started doing digital video in Windows, I started with video capture, since I was looking for ways to abuse the TV-tuner style PCI card I had just gotten. Back then I think I was running Windows 3.1 on an 80486 CPU, so just capturing postage stamp size video (160x120) was taxing, and half-size video (320x240) was possible only on a good day with a tail wind — and, of course, enough hard drive space, which I didn't have nearly enough of. And that was assuming that the cosmos was nice enough to give me enough contiguous memory below 640K for the framebuffers, which occasionally was thwarted by a particularly frustrating sound driver.
Thus, at the time, I wasn't particularly concerned about audio sync drift.
Nowadays I have far more CPU, disk, and memory bandwidth than is necessary to support analog video capture, and I can dump 720x480 video uncompressed to disk at full frame rate without problems. When you capture for more than an hour at a time, however, suddenly that little 0.5% deviation in clock rate starts to show up in the video — and tweaking the frame rate of the output over and over until it looks and sounds right starts to get tiresome. Soon, capture programs with audio synchronization capabilities were born, of which VirtualDub eventually joined the pack in tardy fashion. These attempted to analyze timing on the fly and correct for deviations in timing, thus bringing audio and video closer in sync. And usually, they were more than adequately successful.
VirtualDub's current audio synchronization mechanism sometimes works, and sometimes doesn't. I'm working on tweaking it again within the 1.6.x series to improve its effectiveness. In the meantime, here's some of what I've learned along the way.
To play back captured audio and video in synchronization, you need to know which audio samples match up with which video samples, or equivalently, be able to match both up to a common time base. Timestamps are the most direct way to do this, but a rigid correspondence between samples and time also works, as AVI does. So all we have to do is record the timestamp at which each video sample or audio block arrives and we're done, right?
Well, not exactly.
Sure, AVICap (VFW) and DirectShow both give you timestamps to at least 1ms precision, but the actual precision of the timestamps can be rather bad, as bad as 10ms. And for audio devices, you may not get timestamps at all. Already we're at a disadvantage, since we have to infer the audio timestamps, probably by trying to read the video or system clock as soon as we get a block of audio samples. That adds additional latency which we have to account for and which, more importantly, we don't actually know. Not to mention that there is noise introduced by the time between the audio interrupt and when we actually receive the block and take the timestamp. So basically we have audio and video timestamps of rather dubious quality to work with.
A more serious problem is that you don't actually know when the two streams start relative to each other. Hopefully they're as close as possible, but there's some error to account for there. More on this later.
The final, and actually biggest problem, is that the audio and video devices don't capture at the same rate. The video capture device is using a clock of some sort, but it has some small error from real time. The audio capture device also uses a clock — a different one from video, unless you have both combined in one unit — and that has a different error from real time. Rather than capturing at 44100Hz exactly, it's actually producing audio samples at 44089.1Hz or something slightly off, a tiny error that can start to become visible after an hour or so. On older cards, this error can be much larger; I had a card once that was as far off as 43400Hz. To correct for this error, we need to estimate the difference in rate between the streams and correct one or both streams to compensate.
Measuring the stream timing
The direct way to measure rate is to take the period of video samples or audio blocks, which is the difference between adjacent timestamps. Do this once, though, and you'll get a really noisy sample; an error of 5ms out of 33ms in a video timestamp is huge. An imprecise clock can be one source of problems, background task interference is another, and audio buffering is a third. A subtle fourth source of error is that the capture program can only service audio and video synchronously; in particular, if video takes a long time to compress, it can delay the receipt and timestamping of the next audio block. I think that with DirectShow it may be fairly easy to push the actual audio and video processing to another thread, which would allow the capture thread to timestamp everything very quickly, but I haven't tried it yet.
It goes without saying, but obviously Windows XP is not a real-time operating system (RTOS).
There is a similar problem with determining the start of the stream, as just taking the first sample is similarly noisy. To get this error down, multiple measurements have to be combined to produce a statistic with lower variance and thus better accuracy. Averaging is the usual way to do this, which scales the standard deviation by 1/sqrt(N) for N data points. However, in the case of the frame period, it doesn't work:
x = [(x1 - x0) + (x2 - x1) + ... + (xN - xN-1)]/N
x = (xN - x0)/N
All of the adjacent differences cancel and what you end up with is the difference between the last and first samples divided by the number of frames. The variance, therefore, isn't improved with the number of samples, and the estimate never gets any less noisy.
Is the latency difference significant? Usually, not really. A 10ms constant error is hardly noticeable. On a USB-based capture device with hardware compression, though, I've seen the video stream delayed by 300ms compared to audio. You don't want to run an audio resampler with that kind of estimation error.
The solution I came up with for 1.6.2+ is to compute a linear regression fit to the samples instead, which produces a slope (the frame period) and an x-intercept (the start of the stream). This assumes that the actual frame period relative to the time base is constant over time, which it isn't (clocks can drift in speed due to changes in temperature), but it seems to work well enough, and more importantly, if the linear fit is good the variance in the outputs drops as more frames are captured. We can do this for both the audio and video streams and determine both the relative difference in starting point and rate. This isn't necessary, though; all we're actually concerned about is the current error between the two streams, and constant driving that difference to zero. This takes care of both differences in start and rate.
Missing frames are most straightforwardly dealt with just by looking for abnormally long delays between frames and inserting pad frames in the output file to take up the extra time. This is standard operating procedure for many capture programs in Windows, and contributes to the evil "dropped frames" statistic. If your system can't keep up with the incoming frames, the hardware will start dropping frames to make room for new ones, and so you'll get a bunch of holes in your output file, leading to an unsatisfying, stuttery video. Note that this doesn't necessarily mean that the audio sync in the file is screwed; this is a common misconception. The purpose of putting padding frames in the output file when dropped frames have been inferred is to space out the frames that were captured to match the actual timing as closely as possible. As an example, have VirtualDub process a video in Direct Stream Copy mode with it set to convert to a 120fps frame rate. You'll find that about three-quarters of the frames in the output file are pad frames (zero-byte), but audio sync will be preserved.
Unfortunately, the heavens aren't so nice as to deliver a clean, wide 67ms timestamp delta where there is one frame missing; noisy timestamps mean the exact delta can vary quite a bit, and even worse, interesting glitches in the signal can mean that the delay isn't an integral number of frames. That means the errors have to be accumulated over time, to keep subframe errors from accumulating and gradually wrecking audio sync. Doing this ensures that the error doesn't grow larger than half a frame, at the cost of a periodic single-frame glitch. An alternative solution is to tweak the video timing to cover the small errors, basically resampling both audio and video. This can work for small rate errors or field drops, and I'm experimenting with such a system, but it's important to have an escape hatch that turns off the system and forces pad frames in case of a bad glitch, like a half-second interruption. I've found that if this isn't done, the resamplers go nuts trying to cover the gap between frames, resulting in some chipmunkiness in the audio.
Large disturbances can also result in the correlation between audio and video steps being a bit more broken than a line. Maybe the hard drive got sidetracked for a while by another task, or the clock just decided to jump forward a few seconds (which I have seen happen). This means that the linear regression fit becomes poor, and the error in the relative latency estimate rises. If the break is bad enough, this can cause the estimate to rise far enough that the audio resampler actually desynchronizes the audio. I haven't come up with a really good solution for this yet. A possibility is to check the variance in the estimate and restrict the estimation to only recent samples if that helps. Doing this, though, requires (1) a heuristic to pick the best break point, and (2) a very fast way to do the search. The estimated variance of the estimate (ugh) might be enough for the first, and I think that either sum tables or a tree-like summing structure might work for the second problem.
Personally, I like the fact that we are slowly moving to the DVB age of video capture (digital) and audio/video comes pre-packed and pre-compressed and these issues disappearing.
Blight - 03 12 05 - 03:17
Yes, but how many roadblocks will be deliberately placed to prevent you from capturing the streams?
Phaeron - 03 12 05 - 03:35
because the source is analog, some of the video signal always leaks into the audio path. and this leak is always in sync, by definition... you could try to correlate the captured video with the leaked video present in the audio path - perfect sync! i guess the interesting leakage (for correlation) would be in the 5-20kHz range, so it would be good to apply such bandpass filter to the audio and the reconstructed video before correlating them.
User - 03 12 05 - 11:51
What I don't understand is why there are so many problems maintaining a/v sync even with DV streams (particularly between certain combinations of camera brands and capture software - canon and windv). Wouldn't the timestamp information in a unified device be accurate?
David - 03 12 05 - 12:52
That only works if the video source is RF-modulation; it doesn't work for composite or S-Video inputs, where the audio and video are never merged. The horizontal sync alone is 15KHz, so I can't imagine any meaningful data for correlation leaking through the filters on the audio path. Still, it's an interesting idea.
I don't know. Although audio is interleaved with video in DV, there is still a little bit of play since DV has an "unlocked" mode for audio where you're not guaranteed that they're exactly synchronized per-frame. Also, dropouts and errors in the tape can add a bit of havoc into the stream (a single 48KHz frame in a group of 10 32KHz frames??).
Phaeron - 03 12 05 - 17:19
it works also for composite / s-video. alternating black/white lines would appear as 7.5kHz, but it's a quite rare image. blocks of 5-10 lines of different luminance (a common occurence) would appear in the kHz range... check your raw video recordings (before audio compression) in cooledit or matlab - you'll see the 15kHz carrier even in s-video/composite...
User - 03 12 05 - 18:16
It would be great to have a reliable method of syncing the audio without dropped frames. I have no problems with sync when capturing with frames dropped or added (ie. from TV). But when capturing for editing, when it is important not to lose any frames, and I set this in prefs, currently my audio is out by more than a second or two after 20 minutes of capturing. I then resync shots in Vegas Video where necessary. This isn't a big deal, as I used to edit 16mm film in the past and syncing up separate sound and picture was a fact of life then. In Vegas I can also stretch or shrink the audio for a shot without changing the pitch.
The real-time audio resample feature in Virtualdub isn't reliable on either of my set-ups and it is easier not to use it, rather than possibly end up with audio that has been resampled AND is out of sync.
VirtualDub is the best capture software around though IMO. Thanks for making it.
rgs_uk - 03 12 05 - 22:40
Why resample audio at all? At 30fps, inserting a drop frame or dropping an actual frame to keep in sync with the audio results in a maximum error of 33ms, which is below the threshold for humans to detect somethings as out-of-sync.
Also, if you were to support WMV (now that MS has removed the more onerous licensing terms, at least for non-DRM use), this would get rid of most sync issues since the frame rate can be variable (unlike AVI) since each frame has its own timestamp.
Whoever - 04 12 05 - 05:53
The problem with DV capture is that the DV stream contains a number of "locked" samples per frame. This means that the samples are to be played back in sync with the frame, which is already quite good. But the problem is that the camera doesn't always deliver the correct number of samples per frame: 48kHz audio on PAL, for example, should give 1920 smp/frm, but I've seen my camera returning 1940 samples once in 10 seconds or so. This means that audio resampling is needed even though the audio already is locked. And apperently, some DV capture programs don't do the resampling, maybe because the developers only tested with cameras that did *real* locking, i.e. frame-synced and with really constant number of samples per frame :)
KeyJ - 04 12 05 - 12:55
I worked on this problem in the late '90s for several years. The core problem is when you have more than one crystal, both trying to give you timing information -- there will always be some rate slew (CPU clock / video clock / sound card crystal / etc).
The only way to really solve this problem is to use a house clock, and make sure your audio and video gear all syncs off the same clock (for more on this, google "blackburst"). Needless to say, cheap consumer hardware won't go through the trouble of implementing this crucial feature, so we'll get wow and flutter from "compensating" software forever to come.
If you actually CARE about your multimedia, you need to get gear that can sync right, but expect an audio+video capture rig to set you back over $1,000. On the bright side, you'll be able to play back OUT with the same sync :-)
Jon - 04 12 05 - 21:55
Adding or removing frames can play havoc with inverse telecine (IVTC) algorithms. Also, you can't drop frames from a compressed video stream. As for using a timestamped container format, that won't work because you don't want the timing jitter from the capture to be faithfully reproduced in the playback (there's generally a lot more error in the received timestamps than in the actual source).
At least with SMPTE 314M, audio locking is optional (AAUX PC1 bit 7), and even when it is locked, the camera still has to emit a varying number of samples for NTSC. I think the range over which audio has to be aligned is as much as 15 frames, according to some Adobe tech note I saw. VirtualDub resamples type-1 DV audio in blocks of 10 frames worth of samples to solve this problem.
Phaeron - 05 12 05 - 02:00
That was interesting, even if a bit over my head. I usually use VirtualVCR for long captures, as it has always resampled the audio instead of video framerate, and it is more important to me to maintain perfect video framerate. But what you propose above makes more sense.
Oftentimes, I can capture a long program, and it maintains sync in the captured file, until I apply the IVTC filters afterward. At that time, it loses sync, which is very frustrating. Maybe the new system will help that.
Of course, in the past few months, I've found that the best way to get a program on my hard drive is to record on my set-top DVD recorder,and rip it afterward. It maintains perfect sync, and looks good. Of course, I can't get it uncompressed onto the hard drive that way, and it does occasionally show some compression artifacting, due to realtime compression, even if hardware based.
Brian Young (link) - 05 12 05 - 19:04
FWIW, why does capture have to happen in real time, i.e. ATI's latest MMC versions delay both streams (though perhaps for other reasons), but the lag is only noticable if you're watching a TV next to your PC.
Again FWIW, when I have had problems with sync (not in V/Dub), it's always been because the company that made the capture device designed their software to alter the video rate to chase the audio. Seems they create a bigger problem trying to guard against a smaller one.
@Brian, One problem with a DVD recorder, or using a PC at similar bitrate to mpg2, is that the bitrate is almost never ideal. Either you use too much compression and the file is smaller than it has to be, or you use too little, and the file won't fit. Also if you re-compress, a higher capture bit rate helps.
@Brian, IVT in V/Dub does work, if you feed it good audio at the same time that's in sync. OTOH, in my experience it does work better with captured mjpg1 then something like mpg2 with audio stripped and converted.
RE: DV camera work, I believe when/if possible, it's recommended to record audio to something like mini-disc -- from all I've read and seen, don't believe the audio circuitry on most DV camera's can compare. If I remember correctly, DV was designed originally more for consumer products, with perhaps less attention paid to audio handling.
@Blight, yes but dvb often delivers odd-sized, off aspect frames to save bandwidth, and as noted, getting files to your drive is often very iffy, especially with cable in the US.
RE: Carrier signals etc..., in the US the trend is (& has been) moving towards digital signals converted back to analog in the home. As HD technology becomes more common, I would expect the cable companies to follow the sat folks and switch everything over to digital, I'd imagine to some sort of mp4 variant, dropping analog entirely to save bandwidth.
@rgs_uk, Agree re: Vegas... Goldwave is an alternative that works well when the accumilated delay in steady, where the audio doesn't move in and out of sync -- just time stretch/shrink the playing time of the wav to match video duration.
mike - 06 12 05 - 15:55
As someone else said, a lot of this is way over my head. As for dropped frames, losing any frames from the picture (or having 'double' frames) is not acceptable for serious editing.
Mike - stretching/shrinking a long section in Vegas rarely seems to maintain satisfactory sync along the entire length. Also what is everyone's view on time stretching/shrinking in Vegas and Goldwave and its effect on audio quality? Does it reduce the quality? I'm thinking for archive purposes.
rgs_uk - 07 12 05 - 00:35
Virtualdubsync has always gives me consistent audio delay. That's why I still use it over the current build. It's predictable. I just offset all captures by -80ms. The 1.6.x builds seem to give me varying results. I stopped captureing with 1.6.x as I already have a working method with vdubsync.
But I just wish the video frequency would stay put.
Rather than match the video to the audio, or the audio to the video, I'd rather have vdub adjust video fps to a fixed number first. Then match the audio to the corrected video. I'm talking about a nice round 25fps with no decimal point for example (PAL obviously). Even if the audio is offset as with vdubsync, thats easy to fix after capture.
And that would mean you could append files together perfectly.
I'd think if it were this simple it'd be done already so I may barking up the wrong tree here. As I see it the hardware is inconsistent so the software must set the timing.
SonOfAdam - 07 12 05 - 09:01
But... I try to record from my TV card directly to DivX 5.02. Using Virtualdub 1.5.10 I can check "Lock video stream to audio" and it works. I can use other applications to push the CPU to 100 %, there will be distortions in the recording, but after that it synchronises again. Not so with VD 1.6! It stays out-of-sync for the rest of the recording (and unsynchronized sound almost always appears after a few minutes of recording even if no other process is on. I have a AMD XP 2500+ Barton which should be enough... I only capture at 284 x 188 (low resolution PAL).Is there any way to make the 1.6 work just as good as the 1.5 Virtualdub does? (I like the fact that VD1.6 automatically starts audio recording from AUX even if another application (my IP phone software) previously had changed recording input to microphone. But a non-synchronized video is not fun at all...)
Orvar Lyckholm - 14 12 05 - 13:44