Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Dealing with crash reports

In an ideal world -- or, rather, an ideal world from the programmer's standpoint -- software is written, polished, completed, shoved out the door, and that's that. In the real world, it doesn't work that way. Software programs are now too large and complex to be pushed out reasonably bug-free the first time, at least for the waiting time and cost that consumers are willing to put up with. Even in the world of commercial packaged software, where releasing updates is an expensive process given the need to actually find and pull people who still know the code and regression test the software, you're still generally expected to release one patch. But how do you know what to patch?

Crashes are finicky. If you tell people to simply report crashes to you, you're going to get a lot of worthless reports, guaranteed. This doesn't mean your users are stupid or unmotivated, because they largely aren't, but most aren't trained as to what they should look for and report in a software failure, much less for your software. So the best thing you can do to improve your crash diagnostic capabilities is to rig your software to report what you want to know, and that can be done by having it write out a crash report. A crash report is a quick dump of the state of the program and the program's environment at the time of the failure. Since it is done programmatically, a lot of reporting bias is removed, and even if the report itself isn't useful, you can match it to other reports to obtain clues as to what is causing failures in your application.

That leaves the problem of what to put in the crash report.

I'll assume that you don't have one of the easy methods available, such as trucking the person and his machine in next to you (not generally feasible unless you're dealing with in-house QA), and taking a full memory image of the entire process, which takes forever to upload and waste tremendous amounts of storage and bandwidth. What we're looking for here are items that (a) speed up identification of the most common failures, (b) are space efficient, and (c) fairly easy to implement. You don't need a truckload of information in your crash report for it to be effective in helping reduce defects in subsequent releases.

What to put in the report

First, you need basic identification information: program name and version/build. You need the enough information to pinpoint the exact build that failed, down to the executable with the exact checksum. If you can't determine the exact build, then it's going to be annoying to narrow down the possible changes causing a regression and even more annoying to do any sort of disassembly-based sleuthing. Ideally, this is also fine-grained enough to also distinguish internal or local builds, so you can quickly throw out any reports that were inadvertently generated by pre-release or otherwise non-official builds.

You should then also dump the type of failure. Access Violation is the most common type, but you should remember to dump the address and read/write flag so you can determine if it was a null pointer dereference. Other types that will pop up are: unhandled C++ exception, privilege violation (caused by an unaligned SSE access), stack overflow, not implemented (hitting the guard page of another thread's stack), and illegal instruction.

The next most important piece of information is the execution point of failure. Most failures are caused by stupid coding errors, and just knowing which function failed is often enough to spot the bug in the source. Therefore, you should at the very least dump the instruction pointer (EIP in x86), as well as your module's base address if you are in a DLL; if you have symbol names in the dump, or a post-process utility that quickly adds them to the report, you can speed up initial bug report triage. Line number helps even more, but don't throw away EIP in case the line number information is inaccurate.

A stack trace is also important, because of the failures for which the IP alone isn't enough, knowing the next few calls will often do the trick. Now, if you are on Win32, don't make the newbie mistake of relying on the DbgHelp StackWalk() function. It is not reliable. Win32 on x86 is such a squirrelly execution environment that it is not generally possible to determine the call stack by the disassembly alone, due to the presence of __stdcall and thiscall calling conventions, which rely on the called function to remove arguments... with the problem being that you can't determine the called function statically due to indirect calls. Even worse, the Visual C++ compiler doesn't generate nearly enough information to unwind the stack reliably from all failure points in optimized code, and in foreign code you will have no debug information at all. If you rely on StackWalk() for your stack, you have a high probability of missing critical calls near the point of failure, and even worse, not dumping enough information to manually reconstruct the correct stack. An EBP-based frame crawl is even more worthless considering that any good optimizer is going to omit the frame pointer. So what should you do?

Bite the bullet and dump the raw stack. Not only will this allow you to manually reconstruct the stack if required, you can also extract valuable parameters and local variables if you spend enough time with the disassembly. You don't need much; a few hundred bytes is frequently enough. I should note that VirtualDub didn't do this originally, and still only dumps only a small portion around ESP. One reason for this is space, and another reason is that I need a crash report that is easily scannable without any additional tools. What it does do, however, is attempt to produce a call stack that doesn't lie. What it does is scan the stack and identify DWORDs that point to potential call sites -- executable memory with preceding data that looks like a CALL instruction. This produces some false positives, but never any false negatives, so the reported call stack is always a superset of the correct stack. More importantly, it can take advantage of information that isn't available to me, namely the instruction data of the other modules within the process, and the names and locations of DLL exports.

A disassembly, or instruction bytes that can be used to form a disassembly, can be useful, especially in conjunction with a register dump. It's not going to be useful except for the hard cases that require machine code level grunt work, though, and requires an experienced programmer with good knowledge of compiler code generation and assembly language. It is quite useful, however, if you are dealing with a crash that is caused by interaction with a third party module, for which you likely don't have the source code or even the binary. It's also useful in that it can help identify the code involved if you don't have the symbols for the build that failed -- an unchanged routine tends to compile to the same object code even if it has moved in location. The sticking point here is that good x86 disassemblers are hard to find and harder to implement. If you think they're easy to implement, I should remind you that some opcodes change mneumonics depending on prefix (JCXZ/JECXZ and MOVSB/D/W/Q), some are aliases (NOP is actually XCHG EAX, EAX special-cased), some old prefixes have been repurposed (many SSE instructions overload the REP and REPNE prefixes, and SSE2 overloads the 66h size override), and Intel just added three-byte opcodes with the SSSE3 instructions in the Core 2 Duo. I gave up trying to find nice patterns in the x86 decoding mess and just implemented a full-blown pattern matching engine for VirtualDub's disassembler. If you're looking into doing this and don't have a disassembler already, I recommend just dumping out raw bytes and hacking up a tool to abuse DUMPBIN /DISASM on your end.

You should also consider a module (DLL) list for several reasons: identifying DLL version mismatches; spotting intrusive third-party applications that are known to interfere, particularly "window skinning" or applications that otherwise have global hooks; and identifying mystery code addresses within the report. The catch here is that the module list can be quite big.

Don't forget to dump the version of Windows that the program was running on. If you have optimized code paths, consider dumping the CPU type, and if you are multithreaded, the number of logical hardware threads.

Finally, you might want to dump a machine identifier, like the computer name, so you can tell if it's a particular machine that is giving you grief. Bad memory can and does pop up in the wild, given that the amount of memory that machines have has gone way up, but error rates haven't decreased to compensate. Dumping a machine identifier in a public scenario may have privacy implications, though, even if the identifier is hashed.

The nitty-gritty details

Don't dump thread and process IDs unless you actually write a table of them somewhere or having other TID/PID values to compare against. Otherwise, they're useless, because they change between every run. What exactly is thread FFFFFF9C? The same goes for handle values, unless you're dumping them to determine if they're null or corrupted.

Segment registers are absolutely worthless on Windows NT, because they never change. The kernel changes the selectors instead. I believe they can change on Windows 9x, but their values are still worthless.

ESP is useful, even if you have nothing to compare it against. Why? Because on Windows NT/2000/XP, the thread stack for an application with default linker settings always grows down from 00130000 to 00030000. From the value alone you can determine if it was the main thread that crashed, and whether deep recursion (or otherwise high stack usage) was occurring.

You might as well dump all of EAX/EBX/ECX/EDX/ESI/EDI/EBP as well. In a C++ method compiled by VC++, the this pointer is in ECX on method entry, and optimized code will often move it into EBX or ESI.

Floating point, MMX, and SSE registers aren't likely to be useful. You may want to consider dumping the FPU control and tag words, though. The FPU control word will help you determine if a floating-point exception was caused by an external module mucking with the thread's floating-point exception mode, as the Borland CRT is apt to do; the tag word will quickly indicate if a crash may have been caused by a missing EMMS/FEMMS instruction in MMX code.

In code that makes heavy use of exception handling, dumping the FS:[0] SEH chain could theortically be useful because it's one of the elements of the call stack that Doesn't Lie(tm). I don't think I've ever had enough nested scopes to make it worthwhile, however.

Application-specific data can be extremely helpful in debugging, but be careful: the greatest sin you can commit here is to crash again in the crash handler. Limit the data structures that you crawl, protect the code in exception handlers, and dump the app-specific info last so the rest of the report survives regardless. Don't forget to flush any I/O write buffers in the process.

Comments

Comments posted:


Just write a minidump-with-heap instead. Then to read it, you can use a real debugger (e.g. Visual Studio) which will load symbols, walk stacks, enum threads etc. For details see http://www.codeproject.com/debug/postmor..

Andy Pennell (link) - 12 10 06 - 12:30


I've used VD's crash handler in StepMania (@ SF), with an optional mode to allow automated crash reporting (dumps over HTTP). A mailing list shows me submitted crash reports, with the product name, revision and crash reason ("Access Violation", failed assertion condition, etc) in the subject. You'll want to start a new process to do this, though; don't try something as complicated as internet traffic under crash conditions. I re-execute the program itself with a special argument, and send crash information over a pipe. A lot of people will happily press "submit this automatically" where they wouldn't spend the time to send an email; they're anonymous, so I can't ask for more information, but the added info is worth it (and no matter how much you say "include this crash info with your email", people will ignore it and you'll have to spend time asking them for it again, and they'll inevitably paste it in some long-line-mangling fashion).

One section of those crash reports is a concise dump of hardware information, compiled by the engine as it starts up, and cumulatively dumped into a static buffer, which is checked for sanity by the crash handler before use. I try to keep this concise and easily skimmable.

Name your threads. I think VD's crash handler does this. It tells me the name of the thread that crashed ("Mixer thread", "Main thread"). I have a mechanism for naming threads I didn't create myself (including the main thread).

It's possible to dump a stack trace of all threads, but that becomes very long and I've never needed it. I've had mutexes that can sort of detect deadlocks (with a timeout), and made the crash reporter dump stack traces of both threads. I also name all mutexes (and other synchronization objects).

I like receiving readable crash dumps, which minidumps don't provide. I also use similar crash handlers for Linux and OSX.

Glenn Maynard - 12 10 06 - 19:04


Andy:
You do need at least VS.NET 2002 to open minidumps, and they have a few disadvantages. They're binary, which is a problem if you don't have an automated send mechanism. They're also generally bigger than a text report (~30-50K). Probably the biggest problem is that they're almost impossible to use without the modules in question -- if for some reason you don't have both the original EXE and PDB, you're stuffed. On a large project, the symbols can be much larger than the MAP file, which makes it difficult to archive the necessary information for a large number of internal builds. It is true that if you are set up to decode a minidump, though, you can dig quite a bit more information out than you might be able to do with a text report.

I'd also like to point out that Visual Studio's stack crawl is not very good for an optimized build -- it is frequently incorrect and omits vital entries or even shows false ones. The VC++ team has stated on their blog that for Orcas they are not considering any stack walk bugs that cannot be solved with /Oy-, which as far as I am concerned is a non-shipping configuration. I need a usable stack in an LTCG build. There have been many times where I have given up on VS's call stack and just attached WinDbg to do a brute force "dds esp L100" instead.

Glenn:
You've taken this a lot farther than I have, and I really like the idea of relaunching the executable. I was planning on just freezing the first one and having the second one build the report from scratch with ReadProcessMemory(), though. Besides time, one reason I haven't implemented such a system myself is the fear of slashdotting my web site. :)

Phaeron - 12 10 06 - 23:25

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.