§ ¶Win32 timer queues are not suitable for high-performance timing
I read a suggestion on a blog that Win32 timer queues should be used instead of timeSetEvent(), so I decided to investigate.
First, a review of what timer queues do. A timer queue is used when you have a bunch of events that need to run at various times. What the timer queue does is maintain a sorted list of timers and repeatedly handles the timer with the nearest deadline. Not only is this more efficient when you have a bunch of timers because you don't have a bunch of pieces of code all maintaining their own timing, but it's also a powerful technique because it can allow you to multiplex a limited timing resource. It's especially good when you have a bunch of low-priority, long-duration timers like UI timers, where you don't want to spend a lot of system resources and precise timing is not necessary.
The classic timer queue API in Windows is SetTimer(). This is mainly intended for UI purposes, and as a result it's both cheap and imprecise. If you're trying to do multimedia timing, SetTimer() is not what you want. It's also a bit annoying to use because you need a message loop and there's no way to pass a (void *) cookie to the callback routine (a common infraction which makes C++ programmers see red). The newer timer API, however, is CreateTimerQueue(). This allows you to create your own timer queue without having to tie it to a message loop, and looks like it would be a good replacement for timeSetEvent().
Unfortunately, if you're in a high-performance timing scenario like multimedia, Win32 timer queues suck.
The first problem is the interface. CreateTimerQueueTimer() takes a parameter called DueTime, which specifies the delay until the timer fires for the first time. Delay relative to what? Well, when you call the CreateTimerQueueTimer() function. Or rather, some undetermined time between when you call the function and it returns. The problem with an interface like this is that you have no idea if something sneaks in between and stalls your thread for a while, like another thread or a page fault. Therefore, you get a random amount of jitter in your start time. Another problem is that if you are creating a repeating timer, you can only set the period in milliseconds. That's not precise enough for a lot of applications. If you're trying to lock to a beam position on a 60Hz display, for instance, this forces you to take a third of a second error per frame, or a 2% error.
That's not the worst part, though. Let's say we just want a regular 100ms beat. That shouldn't be hard for a modern CPU to do. Well, here are the results:
The first number is the time offset in milliseconds, measured by timeGetTime(), which has approx. 1ms accuracy and precision. The second number is the delta from the last timer event. Notice a problem? The period is consistently longer than requested. In this case, we're 10% slower than intended. If you request a lower period, it gets much worse. Here's the results for a 47ms periodic timer:
The average period is about 63ms, which is about 30% off from our requested period. That's terrible!
There's another factor, by the way: the timer queue API shares the same characteristic as many timing APIs in Windows of being dependent upon the resolution of the system timer interrupt and is thus also affected by timeBeginPeriod(). The reason I know is that the first time I tried this, I got results that still weren't great, but were a lot better than what you see above. The 47ms timer, for instance, turned in deltas of 48-49ms. Then I realized that I was playing a song in WinAmp in the background, and had to redo the test again.
After being somewhat depressed at the mediocre performance of the timer queue API, I remembered the CreateWaitableTimer() API. This is a different API where you create a timer object directly instead of part of a timer queue, and it also runs the timer in your thread instead of a thread pool thread, which is much easier to deal with if you're trying to time work that requires APIs that must be called on a specific thread, particularly most DirectX APIs. As it turns out, the waitable timer API doesn't fare any better than the timer queue API for periodic timers, as it still takes the period in milliseconds and still has the same problems of significant, consistent error in period and sensitivity to the timer interrupt rate. However, the good side is that you specify the initial delay in 100ns units instead of milliseconds, and more importantly, you can specify an absolute deadline. This is very advantageous, because it means that you can compute timer deadlines internally in high precision off of a consistent time base, and although each individual timer may be imprecise, you can precisely control the average period.
Caveat: I've only tried this in Windows XP, so I don't know if the situation has improved in Windows Vista or Windows 7.
In current versions of VirtualDub, I don't use any of these methods for frame timing in preview mode. Instead, I have my own timer queue thread that uses timeGetTime() coupled with a Sleep() loop, with precision boosted by timeBeginPeriod() and with thread priority raised. You might think this is a bit barbaric, and it is, but it was the best way I could think of to get a reliable, precise timer on all versions of Windows. The trick to making this work is putting in feedback and adjusting the delay passed to Sleep() so that you don't accumulate error. As a result, I can do what I couldn't do with timeSetEvent(), which is to have a timer that has an average period of 33.367 ms. I suppose I could rewrite it on top of the waitable timer API, but it didn't seem worth the effort.
Here is the test program output from Windows 7 RC for a 50ms timer:
It seems that Windows 7 does use a method that prevents accumulated error, thus giving an accurate period on average at the cost of consistency.
More interesting is when you give it a very short period, one that is shorter than that of the timing source:
In this case, the timer fires multiple times back to back.
Just curious, is there no way to sync playback precisely with the monitor refresh and/or audio playback rate? It always bothers me that I need special hardware (AJA/Avid/DeckLink/etc) to do broadcast-quality video output, where the main stumbling block to doing it directly with commodity hardware is not bandwidth but frame and audio synchronization.
This also makes me cringe when I see people driving high-end home theater equipment from a PC - how can you guarantee frame-for-frame synchronization of software video playback with the refresh of the TV or projector?
(I seem to recall that SGI's old multimedia API had some neat features where you could wait on frame refresh using select() or poll() on a UNIX file descriptor... is there anything like this on PCs today?)
Dan Maas - 04 09 09 - 17:55
The situation is very bad if you are in windowed mode, especially in Windows XP. Essentially, the only way you can do refresh-synced display in windowed mode is to poll. There isn't even a command to wait for vertical blank. (Polling with 100% of the CPU is an egregious sin.) You also can't get a frame counter or queue anything to the vblank, so you also have to stall the graphics accelerator so you can time the blit with the CPU. Direct3D has vblank-synced presentation modes in windowed mode, but internally on XP, it just does sleep/poll/blit loop.
One alternative is to use a video overlay, but there are a few drawbacks. First, they're usually only YCbCr -- bad if you want to display rendered graphics. Second, they're buggy as all across video hardware and drivers, and you get everything from corrupted displays to bad chroma upsampling to whacked levels. Third, there's generally no option to render as fast as possible, so you typically get throttled to refresh rate. That's bad if the video you happen to be showing is running slightly faster than refresh, because you get latency problems as your render loop is stalled.
Full screen mode is also better, because at least there you get hardware-based, v-synced page flipping. The main problem there tends to be latency, because you don't actually know the exact number of frames that are queued up. There used to be problems with video drivers that could queue up too many frames when the GPU got backed up, like five or more frames. Some programs resort to hard-stalling the GPU or using events to control latency.
The situation is theoretically better on Windows Vista with a WDDM driver. I say theoretically just because I haven't tested it. The main advantage is that with a WDDM driver, everything is redirected to a single 3D pipe, which solves a lot of the stalling problems you get with mixed 2D and 3D in windowed mode on XP. Direct3D 9.L also provides much better status and control over frame queueing, such as the ability to set a latency limit, to query frame timing statistics, to actually wait for vertical blank (although I would have preferred an event), and to explicitly tell the driver to discard queued frames to catch up. I believe you lose the ability to query the exact beam position if you are using D3D10, but you don't care anyway if a composited desktop is used. Windows 7 improves things again slightly by reintroducing hardware overlay support, bypassing performance problems when the desktop can't recomposite quickly enough, but improved over XP's overlays in that now you can drive the overlay from D3D instead of DDraw and with more precise specification of color spaces.
Don't get me started with TV out, where you can't even control fields reliably. Sucks on most Linux systems, too, if my MythTV box is any indication. I hear that NVIDIA's VDPAU API is a huge improvement. Wish they'd port it to Windows.
What would really be useful is a reliable way to do timestamp-based presentation. The only built-in way to do this right now is DirectShow through VMR9/EVR. That's (a) a pain in the butt to deal with if you aren't doing decoding through DirectShow, and (b) it has only marginal success at avoiding tearing, depending on your system. It also has some quite annoying levels problems due to the programmers' annoying insistence that people whack their monitor settings for best color output, despite it being a huge pain to deal with and totally screwing up display of anything other than video.
Probably what pains me the most is that I could probably write a reliable, easy-to-use timestamp based presentation API on my Atari 800XL....
Phaeron - 04 09 09 - 22:33
Which blog was this?
asf - 05 09 09 - 04:18
> Which blog was this?
I'm a bit reluctant to say since I don't want to beat up on MS folks too much, but it was from this post:
I was reminded of it by a comment on a newer post, and later found out amusingly I'd already read it and commented on the timeGetTime() vs. GetTickCount() error (on which he's clarified correctly in a recent followup).
Annoyingly, I just checked again, and the docs for timeSetEvent() are still screwed up like they were four years ago....
Phaeron - 05 09 09 - 05:58
Does not this help (Frame Refresh) ?
asd - 05 09 09 - 07:52
That's a previously internal function that takes a kernel handle you're not likely to have. Good luck calling it with a Direct3D object.
Phaeron - 05 09 09 - 09:42
Phaeron, what do you think of HPET? My understanding is that there is no OS support in anything less than Vista, but can it be used directly? There is a driver.
GrofLuigi - 05 09 09 - 10:01
HPET is long overdue, but not all BIOSes have it enabled. I have no idea to what extent Vista takes advantage of it. I suppose I should try redoing these tests in Windows 7 RC.
Phaeron - 05 09 09 - 10:25
I think it is used for at least QPC/QPF.
Yuhong Bao - 05 09 09 - 13:50
DirectSound could be used as a clock itself, just by watching the playposition of the buffer and converting it to some reference time. I have never measured its precision though, wheather it beats timeGetTime at a normal 48khz. In DirectShow playback is synced to that, if there is audio.
Gabest - 06 09 09 - 10:45
If I remember correctly, on Win9x and WinNT5 at least, DirectSound's clock is based on the sound card's internal clock - so as to provide as much of a gapless and scratchless playback as possible. However, these clocks are on many sound cards quite imprecise - they are not suitable for a reliable timer.
On Vista, since DirectSound is entirely emulated in software, preventing hardware access, I think it is based on the system clock. Whether is uses the HPET if present or not, I don't know.
Mitch 74 (link) - 06 09 09 - 19:45
I seem to recall hearing that even on some cards that should have audio clocks that the driver still reports position to 10ms accuracy anyway. It might have something to do with avoiding an expensive trip to kernel mode to read the DMA counter.
It's been a long time since I did SB16 programming, but I wonder whether there may have been a problem with getting a stable readout of the DMA pointer. I don't remember whether the 8237 latches both bytes together or whether you can get mixed LSB/MSB values like with the timer chip. And then, of course, there's the problem of buggy chipsets....
Phaeron - 06 09 09 - 21:21
Even X-Fi Titanium has a problem with the internal clock. When I enabled Dolby Digital Live, I noticed random skipping watching movies. I was forced to use ReClock instead of the default DirectSound filter. ReClock uses the video card clock instead of the sound card clock.
Mirage - 08 09 09 - 11:03
Just remembered another clock source I had in my machine. Live digital broadcast has PCRs in transport stream, this is from a satellite (converted to 100ns for directshow):
The clock is 27MHz, so the resolution is about 0.037 ms. Buffering and other data transformation delay may introduce some jitter of course. I could only find two PCRs per buffer at most (the buffer size was about 64k), but that depends on how loaded is a frequency with programs.
It would be quite crazy to just use the hardware for the clock though :P
Gabest - 08 09 09 - 15:13
That's only useful for video timing, though. It's not very useful if you're trying to do local timing, like trying to time the playback!
Also, that assumes that the timestamps are correct. It was a joyous day when I discovered that one particular popular MPEG-1 encoder had a tendency to randomly encode GOP timestamps that were a full minute off....
Phaeron - 08 09 09 - 15:17
You're definitely right that CreateTimerQueue isn't good for multimedia - it's guarantee is that the timer will delay *at least* as long as you ask, not necessarily that it will delay *exactly* as long. If you set a timer on thread A and thread B is running when it goes off, we're not going to preempt thread B to instantly service thread A's timer function.
The best time source you've got in Windows is QueryPerformanceCounter; as for a high-resolution, thread preempting timer that will guarantee your code is running every n milliseconds on the dime, not even kernel mode has this :-/ What you could (must!) do is live with CreateTimerQueue's inaccuracy, but correct for it via QPC.
Paul Betts - 08 09 09 - 18:16
Thanks for the information. The error in a single callback is understandable, but the gross cumulative error in periodic timers on XP is less defensible (and arguably should have been documented). I much prefer the behavior in Windows 7.
As for QPC, I've actually found that it's not a good idea to use QPC in production code, because there are too many machines on which it is unreliable: everything from its rate changing based on processor scaling to random discontinuities due to TSC mismatches between processor cores to periodic large jumps in value. As a result, I now use timeGetTime() for anything that needs to be reliable -- it's lower precision and more expensive, but you can generally rely on it to be stable.
Phaeron - 08 09 09 - 18:40
Is NtSetTimerResolution of no use?
Mark (link) - 08 09 09 - 21:13
Ditch sleep, and switch to using select() for timing. Surprisingly, it does have a better-than-millisecond resolution. When I was researching timing, this was the best I could find that worked consistently well. (I put the select()ing thread on higher priority, perhaps even realtime -- don't remember, it was 5 years ago).
Timers are for wimps - 09 09 09 - 00:09
Can you post your test code?
Kevin - 09 09 09 - 08:12
"As for QPC, I've actually found that it's not a good idea to use QPC in production code, because there are too many machines on which it is unreliable: everything from its rate changing based on processor scaling to random discontinuities due to TSC mismatches between processor cores to periodic large jumps in value. "
I argued this before, and well, the Windows HAL is supposed to find a reliable timer source. Linux does this as well, with it's timer= options.
"If the HAL can't use RDTSC, what does it use instead? Well, as I said, it's up to the HAL to find something suitable. "
I also said in http://www.virtualdub.org/blog/pivot/ent..
when I was comparing it with RDTSC that :
"Also, at least the OS can fix the problem, while RDTSC is just an assembly instruction that the OS cannot do anything about. The /usepmtimer switch is a good example. It would not be able to fix program that use RDTSC, but it will fix programs that use QPC()."
Yuhong Bao - 09 09 09 - 18:20
It doesn't matter if the HAL is supposed to pick a reliable source if there are significant numbers of systems out there where it doesn't. What matters is that QPC() is broken on some systems. Telling users to make system-wide configuration changes to address the problem is a non-starter, because (a) it's highly user unfriendly, (b) it doesn't always fix the problem, and (c) not all users have such access. Try shipping a commercial product where you have an insert in the manual to put /usepmtimer in boot.ini.
Phaeron - 09 09 09 - 18:35
And it is unfortunately less flexible than under Linux, where there is the clocksource= options.
Yuhong bao - 10 09 09 - 08:38
BTW, that particular AMD dual-core problem where /usepmtimer was required was I think fixed in Server 2003 SP2, and Vista of course don't have this problem.
Yuhong bao - 10 09 09 - 08:47
"Try shipping a commercial product where you have an insert in the manual to put /usepmtimer in boot.ini."
Fortunately, only some systems need this switch, not all systems, and it is automatically added when you install the AMD K8 processor driver fron AMD.
"Not all at the same time. It is like claiming Netscape 8 is insecure because it expose you to the bugs of both IE and Firefox."
Actually, I think Netscape 8 runs IE and Firefox in the same process, so a crash in either will crash the entire Netscape 8 process.
Yuhong bao - 10 09 09 - 08:54