Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Calendar

« December 2014
S M T W T F S
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ FPO and a callee-pops parameter passing convention makes perfect stack walks impossible

There's a bit of discussion over at Larry Osterman's blog about the Frame Pointer Omission (FPO) optimization in the Visual C++ compiler and how it affects stack walking, which I've been participating in. I figured I'd expound it a bit more here.

The basic problem to be solved when doing a stack walk is finding the locations of return addresses in the stack, which are also the locations of the stack pointer upon entry to each function in the call stack. If you can somehow determine how much local data is present at each stack frame, you can maintain a virtual stack pointer and hop from stack frame to stack frame until the call stack is determined. On x86, the steps involved are generally as follows:

  1. Obtain the instruction pointer (EIP) and the stack pointer (ESP) of the thread.
  2. Look up the current virtual EIP in debugging information to determine the current function.
  3. Obtain the base of the stack frame, either by reading the frame pointer on the stack, or offsetting ESP if there is no frame pointer. This is now the new virtual ESP.
  4. Read the return address into the virtual EIP.
  5. Go to step 2.

The trick is trying to determine the base each stack frame. When EBP frame pointers are present, this is easy -- just keep following the saved frame pointers next to each return address. What's not so easy is the FPO case, where ESP is used directly, because the offset from ESP to the return address depends on how much local variable space is allocated, and how many parameters for called functions are present.

I claim that it is impossible to reliably stack walk in the general case with the __stdcall or thiscall calling convention and FPO involved -- even with full debugging information! And no, the code doesn't have to be that weird.

Consider the following function disassembly:

  00000000: 8B 01              mov         eax,dword ptr [ecx]
  00000002: 6A 02              push        2
  00000004: 6A 01              push        1
  00000006: FF D0              call        eax
  00000008: FF D0              call        eax
  0000000A: C3                 ret

What would be the appropriate debug information for this function? Ideally, you would want to encode an ESP-to-return-address offset for each instruction, so based on the instruction pointer you could unambiguously determine the offset from every possible instruction that could crash. In some cases, you wouldn't even need to encode this information, if you could walk the instruction stream and update a virtual ESP based on executed instructions. This is frequently possible with compiler-generated code, since the compiler uses well-defined and simple patterns to maintain the stack. This is often done with RISC CPUs that have very easy to parse instruction streams. It's also done on X64, with the help of restrictions on prolog/epilog code and unwind bytecode. X86, however, has neither of these advantages.

Let's say the second CALL instruction in the above code crashes, due to EAX=0 -- which would mean that the first function call returned a null pointer. What would the proper offset to add to ESP to get to the return address? You can't tell from the called function, because the call is indirect and you don't know which function was called.

If you answered 0, you were wrong. If you answered 8, you were wrong. In fact, no matter what value you picked, you would be wrong.

Here's the source code that produced the above machine code, when compiled with Visual Studio 2005 SP1 at /O2:

typedef void *(__stdcall *StateFunction0)();
typedef void *(__stdcall *StateFunction1)(int a);
typedef void *(__stdcall *StateFunction2)(int a, int b);
struct IState { virtual void RunState() = 0; };
struct State0 : public IState { virtual void RunState(); StateFunction0 fn; };
struct State1 : public IState { virtual void RunState(); StateFunction1 fn; };
void State0::RunState() { ((StateFunction2)fn())(1, 2); }
void State1::RunState() { ((StateFunction1)fn(1))(2); }

(The unusual casting is due to C++'s inability to create a recursive typedef. Returning a function pointer as a void* is common when programming state machines, for this reason.)

You might say, aren't there two methods there? Yes, but they compile to the exact same code, and the Visual C++ linker will collapse two functions that have the same code even if they do completely different things. Essentially, the correct ESP offset at the second CALL instruction can be either +8 or +4, depending on whether State0::RunState() or State1::RunState() was executing. Both of these are implementations of the same virtual call on the same interface, so knowing the parent function doesn't help; the only way you could tell is by examining the type of this by checking the vtable pointer, and unfortunately after the first CALL instruction this is no longer available (ECX is a volatile register in the thiscall calling convention). I'm pretty sure that this is unsolvable in the general case except by knowing the entire execution history of the program up to this point.

Moral of this story: Callee-pops calling conventions are absolutely evil with regard to accurate stack walks.

(Read more....)

§ HTTRANSPARENT is evil

I spent part of last weekend tracking down an annoying problem in 1.7.2's video display code. One of my current obsessions is field display in Windows -- now that I have a very small and convenient video capture device, it annoys me that most programs in Windows still display video as if it were progressive, which leads to a poor quality live display. For some reason, DScaler has abnormally high latency with my USB 2.0 device, so it's back to rolling my own. I also want to make use of 3D hardware acceleration, because (a) it's extremely CPU intensive to fill a 1920x1200 display at 60fps, and (b) I'm lazy and it's easier to experiment with pixel shaders than highly optimized SSE2 code.

(As I've said in the past, nearly all features in VirtualDub are tied to some sort of video game or anime series. The non-interlaced field display code got me through Lunar 2. Interlaced field display is for Valkyrie Profile 2.)

Now, the problem with doing 60 fps field display with 3D acceleration is that with a 60Hz refresh rate, you must hit every frame exactly, or at least close enough that the glitches are more than several seconds apart. This is very difficult when you take into account the need to avoid tearing, by not switching frames/fields in the middle of the screen. In windowed mode, this is very difficult. DirectX is lame and doesn't give you any sort of vertical blank event or interrupt -- well, actually, it's IBM's fault for reportedly making the VBI optional for VGA -- and so the only option is to poll. I tried just letting Direct3D do this with D3DPRESENT_INTERVAL_ONE, and not only did it do a poor job of avoiding the beam in windowed mode, but it burned up a lot of CPU time doing so and also blocked my message loop for unacceptable periods, which caused the latency on the DirectShow graph to skyrocket. So, I had to resort to another method.

What I ended up with was moving the entire display window to another thread, so that it could poll in peace at high priority. A persistent problem that kept cropping up here was the display thread taking 100% of the CPU, even though I had a MsgWaitForMultipleObjects() loop with a 1ms timeout. I tracked the problem down to that function constantly returning WAIT_OBJECTS_0, meaning that a message available, without there actually being one -- meaning that PeekMessage() was getting called in a tight loop. I hacked in a Sleep(1) as a temporary workaround, but then I had the weird problem of the UI becoming totally unresponsive even though the CPU was idle 80-90% of the time -- but still repainting. Even weirder, when I took the Sleep() out, VTune showed an abnormally high amount of time being spent in the kernel (ring 0) in functions like "win32k!xxxWindowHitTest."

It wasn't until I looked at the ReactOS and Wine source code that I discovered the culprit.

The problem was a WM_NCHITTEST handler I had put in to accommodate the cropping UI. The cropping UI needs mouse clicks to go through the display, so the display code returns HTTRANSPARENT so that all mouse input propagates to the parent window. There is a warning in MSDN saying that this only applies to windows within the same thread, and it turns out that returning HTTRANSPARENT when your parent is on a different thread is indeed a very bad idea. What happens is that Windows has problems determining which window "owns" the mouse message, and keeps bouncing it back and forth between the threads, resending WM_NCHITTEST to the transparent window each time. In Wine, this is apparently caused by a WindowFromPoint() call after the thread hop, which apparently doesn't return faithful results for transparent windows. Somehow in the real Windows this doesn't cause the threads to lock together, so the threads do idle, but the loop still blocks input messages, giving you a set of windows that repaints properly but doesn't respond to input. This also likely explains the phantom returns from MsgWaitForMultipleObjects(), probably caused by some sort of internal callback.

Removing the WM_NCHITTEST handler gave silky smooth 60Hz video, which freed me to solve some evil jumping puzzles in VP2. :)

The next problem I have to solve is trying to come up with a pixel shader that does better than bicubic interpolation with motion-detection-based weave/bob switching and gamma correction, but that's less enigmatic, at least.

(Read more....)

§ Is it too much to ask to have ONE good image display API in Windows?

Lately, I've been becoming increasingly frustrated with how difficult it is in Windows to reliably and efficiently blit an image to the screen with high quality. It shouldn't be that hard, but it is, because there are half a dozen different ways to do so and none of them meet all of the requirements. So I sat down and made a table of all of the ways to blit an image to the screen in Windows, and how they all suck in some fashion.

VirtualDub has code paths for GDI, DirectDraw blit, DirectDraw overlay, Direct3D, and OpenGL. GDI+ is here because it looks like a good API, until you discover that it has no useful hardware acceleration, has incorrect subpixel positioning for image blits, and is no longer being evolved. I put WPF (Avalon) here because I looked into it as a possible alternative when operating under DWM composition / Aero Glass on Vista, which is problematic since neither GDI nor DirectDraw are accelerated, and Direct3D in child windows seems very flaky. The huge problem with WPF is that it requires .NET managed code, since the API is in .NET and the underlying MIL API isn't documented (grumble); another problem is that it seems unusually slow and flickers a lot whenever windows are resized.

Anyway, the table of image blitting woe:

GDI GDI+ DirectDraw
(blit)
DirectDraw
(overlay)
Direct3D OpenGL WPF (Avalon)
Platform support 95+
NT3.1+
98+
NT4+[1]
98+
NT4+[2]
98+
NT4+[2]
98+
NT4+[2]
driver XP+[12]
Requires managed code no no no no no no yes
Hardware accel w/o 3D HW yes no yes yes no no no
Hardware accel with 3D HW yes no yes yes yes yes yes
Software fallback yes yes yes no yes [3] yes [4] yes [3]
Works with DWM composition sw sw sw no [5] yes yes yes
Bilinear filtering sw [13] sw yes [6] yes [7] yes yes yes
Bicubic filtering no sw no no yes [8] yes [8] sw
Terminal Services sw sw sw no no no sw [9]
Supports 256 color display yes yes yes yes no no ?
RGB format conversion yes sw no no [10] yes yes yes
YCbCr format conversion no no no yes yes yes no
Beam detection no no yes yes yes no no
Beam avoidance (vsync) no [11] no [11] yes yes yes yes no [11]

Explanations:

Notes:

  1. Requires redistributable prior to Windows XP.
  2. Requires redistributable for Windows 95.
  3. With RGBRast. (Refrast is not counted as it requires the SDK and is excruciatingly slow.)
  4. Microsoft's OpenGL 1.1 software implementation is available, but it is very slow.
  5. Not supported. Overlay creation succeeds, but the overlay never shows up.
  6. DirectDraw blits are point-sampled when DWM composition (Aero Glass) is active. Otherwise, filtering is up to the driver.
  7. Varies widely; some drivers don't interpolate vertically, and some only interpolate chroma.
  8. Requires custom implementation.
  9. Can be hardware accelerated between two Vista-based systems using Avalon Remoting.
  10. RGB overlays are possible, but I've never seen hardware that supported it.
  11. Automatic if DWM composition (Aero Glass) is enabled.
  12. Requires redistributable.
  13. Requires Windows NT; quite slow.
(Read more....)