§ ¶VSync under Windows, revisited
Something I've been experimenting with a lot lately is field-based display of video on a computer screen. I've written about this before, but to review, regular analog video isn't composed of a series of sequential frames, but alternating half-frames called fields at twice the rate. That is, instead of updating the whole screen at 30Hz, even and odd scanlines update at 60Hz. This is called interlacing, and is an attempt to increase the quality of video by getting some of the benefits of higher resolution and smoother motion.
It's also a gigantic pain.
Frequently, such video is displayed on a computer screen by simply displaying pairs of fields as frames at 30Hz. The most common objectionable result of this is intermittent combing caused by pairing fields that don't match, which can be resolved by deinterlacing. If you're dealing with film-rate material that has been upsampled from 24 fps, this may look fine. However, for material that's actually been recorded at 60 fields/second, the difference can still be significant and the 30Hz output will be considerably less fluid. The right thing to do is to upsample the video from 60 fields/second to 60 frames/second. This isn't easy, nor is there one "right" way to do it, but the result is worth the effort... and it takes a lot of effort, at least on Windows.
I'll ignore all of the discussions about how to upsample field rate to frame rate, such as bob/weave/adaptive, because it's irrelevant for this discussion: just displaying video reliably at 60Hz is tricky under Windows. At that rate, you're at or very near the refresh rate of the monitor, and that makes the timing a lot tighter. At 30Hz, you have approximately two refreshes or 33ms between frames, which gives you a lot of room for jitter -- it doesn't matter if you're a little bit ahead or behind on a frame here or there. At 60Hz, however, you have only 16ms per frame and have to hit every refresh exactly. Miss one, and the result is a fairly noticeable glitch. Now, you might say that 16ms is a lot of time on modern CPUs, until you realize that the default thread scheduling granularity on Windows NT based platforms is 10ms any disk access can easily take more than 100ms. There's not a whole lot of margin for error here. You can solve these problems by doing a timeBeginPeriod(1) and well, not doing disk access in your rendering thread. The third problem isn't so simple.
Vertical sync (VSync) is the third problem, and the one I've had the most issues with. In order to have a smooth 60Hz display, you also need to make sure that updates are synchronized such that they never occur in the middle of the screen, with would cause objectionable tearing and thus jerky motion. The problem is that there doesn't seem to be any good way to do this on Windows. If you're in windowed mode, the NOTEARING flag on DirectDraw blits doesn't do anything, and both overlays in DirectDraw and INTERVAL_ONE in Direct3D block, which leads to problems with the main thread getting blocked and not being able to hit 60Hz with multiple windows (they alternate). I had resorted to polling the beam in VirtualDub to get this working, and in the end it wasn't that reliable -- there were too many ways in which other processing in the main thread would cause the display code to miss the blit window and skip a frame. In 1.7.2, I moved the display code to a separate high-priority thread to try to resolve this. Unfortunately, Direct3D being the stupid non-thread-savvy API that it is, I had to move the display window another thread as well. That introduced more problems because the Win32 UI hates multithreaded window hierarchies, leading to problems such as: DestroyWindow() not being usable, mouse input being blocked, and slow updates in the cropping dialog. In 1.7.3, I'm moving the display code back to fix these issues, which unfortunately makes the timing and blocking problems worse again.
Recently, I started experimenting with Direct3D full screen mode, which is a drastic but effective way to solve the problem. In full-screen mode, you can directly "flip" the screen by exchanging screen buffers instead of copying, which is more reliable because the display adapter usually has support for doing so asynchronously from drawing. After fixing a bunch of bugs like falling back to GDI after a display mode change (a problem when you've just changed the display mode to go full screen), I got it working, and the result is smoooooooth. You have to use the FLIP present mode instead of COPY to get good vsync lock, but after doing so you can present one frame per refresh reliably. Dialog boxes don't work correctly in this mode, which is not a big problem, since the real problem is that... the application doesn't accept input for seconds at a time now, because the message loop is blocked.
I mentioned that using overlays and INTERVAL_ONE with Direct3D in windowed mode had problems with blocking the message pump. Well, the same problem happens in full-screen mode. A message pump is something that every UI program in Windows must implement, and is a section of code where the program asks Windows if any UI events have arrived and processes them. It's a very common way of delivering events in GUI systems and ensures that events are processed in an orderly fashion. The downside is that if the application stops running its message pump for a while, its UI stops responding during that time. What all of mentioned vsync techniques have in common is that they all block until vertical sync, which means that they sit in a polling loop until vertical sync arrives, during which time the message pump doesn't process messages. And since Windows prioritizes delivery of messages based on message type, it's possible that the application gets in a nasty cycle of spending most of its time waiting for vsync, the remaining time processing some internal events, and no time actually processing input. The result is an application that is happily displaying video and not responding to any mouse or keyboard events whatsoever -- which is what happens to VirtualDub since it receives video frames asynchronously from DirectShow.
There is a sneaky way to get around this problem, which is a bit called D3DPRESENT_DONOTWAIT. What this flag does, when passed to IDirect3DSwapChain9::Present(), is tell Direct3D not to block for vertical sync. Instead, it returns immediately with a status code. Upon discovering this problem again, I had constructed a framework that uses WM_TIMER messages to periodically poll until the flip can be queued again, which fixes the problem because WM_TIMER has a lower priority than input messages. The only problem is that this doesn't work, because Present() blocks anyway. I searched around a bit, and people have reported that it works on ATI cards, but NVIDIA and Intel happily block. Furthermore, the rumor is that this behavior may have been deliberately introduced to work around dumb behavior in the Direct3D runtime, which apparently does a polling loop at 100% CPU if you don't specify the DONOTWAIT flag, which is the common case. Sigh.
(And before you say this problem is fixed on Windows Vista with the DWM, I seriously doubt the DWM can render the desktop at 60 fps when I'm taking more than half the fill rate for my shader processing.)
I actually first discovered this behavior while working on another program that reuses VirtualDub's display code, and "fixed" it in that application by splitting the message loop in half, processing input messages at higher priority. That sort of worked fine for that application, although it has the severe disadvantage of still eating up valuable frame time doing unnecessary polling. I can't use it in VirtualDub, however, because modal dialogs use their own message pump and could lock up anyway. I've been searching for a solution to this without success -- one idea I had was to use Direct3D event queries to watch the length of the frame queue, but that doesn't seem to work -- even waiting for the entire pipeline to flush still causes up to a 16ms stall on Present(). If anyone has ideas about how to work around this problem, I'm all ears.
Could you use a waitable timer to make sure you run your message loop for as long as possible each frame? If you want to flip every 16ms (say 10ms allowing for some overhead), a waitable timer would have the granularity to do that. Every frame run a PeekMessage/MsgWaitForMultipleObjects loop until the timer is signaled and then let D3D do its stuff until the next frame.
jon - 16 08 07 - 01:10
This may be completely useless, but how does OpenGL's SwapBuffers() behave with VSync? Some quick Googling shows everything from "swaps when it feels like it" to "swaps at vsync". Another thought would be triple buffering, as that seems to solve most of the problems of waiting for vsync.
One more idea: once you've found the vsync time, is it possible to schedule a timer such that it fires on each vsync? You could use that to schedule the rendering.
Thomas (link) - 16 08 07 - 04:39
I'm trying not to use solutions that involve modifications to the main message loop, as it's a pain and it destroys abstraction. That's why I went for the WM_TIMER trick. There's also the question of how it behaves with drivers that don't wait. Still, it's a possibility that I haven't tried yet.
SwapBuffers() normally doesn't specify vsync behavior, but you can control it via the WGL_EXT_swap_control extension, which gets you behavior similar to Direct3D swap intervals. Since there isn't an API for a non-blocking present, I'm pretty sure SwapBuffers() would be forced to block once all backbuffers were busy. Unfortunately, I don't believe there is a way to programmatically force triple buffering on or off -- it's up to the driver. If you're able to consistently run at refresh rate, though, the only difference between double buffering and triple buffering is that the latter takes more VRAM and adds more latency.
Synchronizing a timer to vblank to simulate a vertical blank interrupt is difficult on Windows for two reasons: there's no way to determine vblank besides polling (assuming the driver supports beam reading), and the best timer resolution you can get in the API is 100ns, via waitable timers. I suspect that on many systems the resolution is actually 1ms. At 60Hz, the frame period is actually around 16.67ms, so you accumulate error pretty quickly if you just let the timer free run. Generally the way around this, and the way that VBIs were simulated in DOS, was to set the timer to somewhat short of the vblank and poll for it on the interrupt. I really hate doing things like that, though, because it's pure burned CPU time.
When the driver decides to wait is important, too. If you try waiting for vblank and it turns out the driver is looking for the start of vblank transition, the result would be the driver consistently waiting a full frame, which would be a disaster as then you'd be at 30 fps.
Phaeron - 16 08 07 - 05:55
I have hit this symptomatically from a user's perspective when playing back 60fps videos (pre-deinterlaced or computer generated). I don't know the technical details, but my situation seems much better: it kinda works just fine, with some minor caveats.
In Windows XP using DirectShow players with any video renderer, I always get V-Synch and never 100% CPU locks (at low resolutions with fast codecs, 5% CPU usage or less is typical). The problem is that each frame must be shown for at least one scan, so if the framerate of the video is almost the same as the refresh rate of the monitor and there's a video playback hitch for whatever reason, it won't catch-up with the audio (for a long time - 59.94 vs 60 leaves some room). If the framerate of the video is higher than that of the screen, then playback is impossible, as the video will play back at most as fast as the screen refresh rate. However, even in this situation, CPU usage doesn't spike.
In Windows XP with non-DirectShow players, such as VLC and mplayer, which use DirectDraw (directly) by default, it works fine (with V-synch and not 100% CPU) unless they have to do catch-up (they are in the same situation as the DS players: each frame must be shown one scan). When they enter catch-up, CPU goes up to 100%. However, they support framedropping, so they can keep A-V synch (and smooth looking playback), but they use 100% CPU continuously which is not acceptable. I guess you are in the same situation.
In Linux, I'm no expert, but it works perfectly in all players, using the "xv" output (uses hardware overlay). Always V-synch, never 100% CPU usage, never A-V desynch, videos with higher framerate than the monitor play just fine, etc... they got it perfect, at least for my configuration.
As for getting a smooth framerate, as I said, that's critical in Windows, but that doesn't seem to be a problem for me - even if the codec uses 80%+ of the CPU. I've played 2h+ videos without a frame arriving late. Windows is actually better for this, as setting the player's priority to realtime pretty much guarantees it'll always be on time, even under heavy system load. In Linux is tricky because it seems the X server is involved in the process, so you have to raise its priority too, but then you're indirectly giving more priority to anyone drawing to the screen.
None of the cases I listed distinguish between windowed or fullscreen - fullscreen here just means the overlay covering all the screen.
John - 16 08 07 - 10:03
You mentioned that in 1.72, you moved the display code to a separate high-priority thread. This really is the ideal solution, but as you said moving the display window to that (or another thread) causes all sorts of problems. I was wondering, what was the reason you needed to move the display window as well? Is there an alternative where you can move the DirectDraw code without moving the window handling as well? Itís been a while, but Iím pretty sure Iíve done something similar in the past.
JohnP - 16 08 07 - 19:26
The wait in the driver isn't necessarily at 100% CPU, even though it blocks the main thread. I wouldn't be surprised if drivers that blocked deliberately did PAUSE loops and sleeps as necessary to reduce power consumption, which is important on laptops.
As for overlay, that's rapidly falling out of use for two reasons: it's unsupported and totally broken on Vista when WDDM drivers are used, and you can't do any sort of shader processing with them.
Possible, but I've never tried it, and I'd have to rearchitect my entire display API to do it. Not all DirectDraw/D3D functions can be called multithreaded, so it's necessary to split the routines in those that do and don't require window handle access. Frankly, I'm a bit tired of this "architect your whole application around DirectX" stuff -- it's a pain having to craft app-wide solutions for lost surfaces, needing to create only one primary surface, D3DX not being thread-safe, etc.
Full-screen is something I'd really like to get working well, as it's absolutely rock solid, vs. a single threaded windowed solution that sort of works and a multithreaded windowed solution that kinda works better.
Phaeron - 16 08 07 - 20:04
About overlays: as far as I know, on Nvidia hardware at least, there are no more hardware overlays ever since the NV4x - overlays are in fact 'emulated' in the driver (source: Nouveau free Nvidia driver).
Mitch 74 - 17 08 07 - 06:34
Seeing the whole matter under another point of view, perhaps it would be better to continue your investigations in a separate program and not virtualdub itself? I mean sure, it would be cool if virtualdub had rock-solid video rendering performance, but is it 100% needed for this kind of application?
I myself tend to use virtualdub as a video processor and not as a video player. On the other hand make a media player and I would certainly try it (and with what I read from this blog it would be something special :))
ggn - 20 08 07 - 02:17
Technically, it isn't required for you to be able to see video at all in order to process it. I prefer WYSIWYG, myself.
Phaeron - 20 08 07 - 03:16
If you don't want to use WYSIGWYG, mplayer/mencoder gives you pretty much anything you need... Of course, reading the user's manual (its man page) is a byatch, especially if you want to use its XviD implementation - hardly documented, and what is there of it, often wrong and incomplete.
Too bad, because otherwise it rocks.
On Windows, avisynth allows you to do a lot of stuff.
Mitch 74 - 20 08 07 - 07:31