§ ¶Avoiding burnout on the Internet
I've been a long time reader and occasional commenter on Raymond Chen's blog, The Old New Thing. Raymond's an old-time programmer on the Windows team and has a lot of good experience and advice to share... but as of late he's been becoming increasingly frustrated with the comments on his blog, to the point that he's actually begun pre-emptively attacking people by name. Raymond freely admits to having the social skills of a thermonuclear device, but given the buildup I've been seeing over the past couple of weeks that started with the Nitpicker's Corner, it seems to me that the fellow's getting a bit too close to blowing up. Sure, some of the comments on his blog are annoying or incendiary, but perhaps he should disable comments or take a break for a while.
One of the things I've learned about being on the Internet is that it's a really, really big place... global, in fact. That means you're really insignificant, and it's easy to be bowled over by the magnitude of it all -- especially if you attract a lot of attention, as Raymond has due to his skill and writing style. I've had to deal with this too in some ways, due to the popularity of VirtualDub. In order to avoid getting blown out myself, I've adopted some rules:
(Read more....)§ ¶Troubles with _mm_loadl_epi64()
Alright, who was the dork who designed this SSE2 compiler intrinsic:
__m128i _mm_loadl_epi64(__m128i const *p);
What does this intrinsic do? It loads a 64-bit integer value from memory and stores it into the low 64 bits of an XMM register, zeroing the upper 64 bits. It's the compiler intrinsic version of the MOVQ instruction. MOVQ is fairly important for image processing routines in SSE2 for a couple of reasons: it's very convenient to process 64 bits of data, since eight 8-bit samples can be loaded and expanded as 16-bit words in a 128-bit register, and 128-bit memory accesses can't be misaligned like 64-bit memory accesses can.
Anyway, I ran into this while porting my old AP-922 based IDCT routine to intrinsics in order to recompute the constants according to a tip I'd found in a whitepaper (folding column rounding into row pass, genetic algorithm to tune... don't ask). I figured, hey... maybe I'll try intrinsics again... couldn't hurt, right? Visual C++ tends not to do well with MMX intrinsics, i.e. it misgenerates code, so I first emulated the MMX instructions with scalar code. When that worked, I tried rewriting the wrappers with SSE2 for speed.
Only to have the routine utterly and completely blow up.
(Read more....)§ ¶VirtualDub 1.8.0 Released
VirtualDub 1.8.0 is out -- this is a new experimental release that contains many changes I've been working on in the background for months. As this is an experimental release, it is recommended that you stick with 1.7.8 for production use. However, any feedback on changes in 1.8.0 is appreciated and will be used as the 1.8.x branch eventually becomes the new stable branch.
The main big change in 1.8.0 is enhanced audio support, including:
- Support for reading and writing VBR audio with correct sync.
- Input plugins can expose true VBR audio.
- Support for multiplexing in raw MP3 tracks.
- Support for selecting the source audio track when multiple audio tracks are present.
- Built-in audio decoding support for µ-law, A-law, MP2, and MP3 audio.
The VBR warning is still displayed by default, although it can be disabled in Preferences; turns out, some people were using it to detect files that were unlikely to play properly on their hardware players.
The video filter subsystem has also been overhauled for 1.8.0. A side effect of the changes is that some video filters -- in particular, those that use GDI to draw on video frames -- may run slightly slower. However, there are other changes which can allow the filter chain to run much faster as well. The changes:
- Video filters can now increase or decrease the frame rate. A frame rate doubling filter (bob doubler) has been added to exercise this capability.
- Individual filter entries can be temporarily enabled or disabled via checkmark.
- The filter chain can now run directly with YCbCr formats. Existing video filters are still supported via implicit conversion; new video filters can choose to support any subset of the available formats.
- Cropping is supported in YCbCr. The pipeline will, by default, convert YCbCr video to a format with higher chroma resolution if necessary to do the crop -- for instance, attempting to crop to odd pixel boundaries will force a conversion from 4:2:0 to 4:4:4. There is an option in the crop dialog to snap the crop boundaries instead of forcing a conversion.
- The "resize" video filter can now run directly in all supported YCbCr formats. In some cases, this can be significantly faster, as much as 50% faster when this also allows conversions to and from RGB to be omitted.
- The pipeline will automatically convert video formats as necessary. For instance, it is possible to run most of the filter chain in 4:2:0 and convert to 32-bit RGB later to run an existing video filter. The filter dialog indicates where conversions are taking place in the filter chain and can also display the exact formats involved.
- A "convert format" video filter has been added to force conversion to a specific format at a specific point in the filter chain.
- Capture mode has also been enhanced to take advantage of YCbCr filtering. The end conversion to 24-bit RGB is now optional, which means that YCbCr data can be filtered and fed directly to the video codec for enhanced speed.
Video filter authors interested in adding frame rate modification or YCbCr support to their filter should consult the VirtualDub Plugin SDK, version 0.7. The Plugin SDK is still pre-release, but comments and questions are welcome.
There are other miscellaneous changes in 1.8.0, as well as bug fixes that were too risky or extensive to push into 1.7.8.
As I write this, there is an issue on the SourceForge project servers that is preventing me from updating the download page for 1.8.0. If this is still an issue when you read this, visit the VirtualDub project page on SourceForge, and you should be able to download both 1.7.8 and 1.8.0. For those of you who are signed up for new release notifications, the file releases have now been split into stable and experimental packages, so you should subscribe to the virtualdub-experimental package if you wish to be notified when a new experimental release is available. I'd also encourage you to visit the Testing/Bug Reports section of the forum occasionally, as bleeding-edge test releases also appear there.
Changelist after the jump....
(Read more....)§ ¶They called _what_ in the inner loop??
AMD just open sourced the AMD Performance Library as Framewave, which at least from my perspective seems like a good thing. Not that I'm going to attempt to use it, but I perused the source out of curiosity, and it looks like there are some useful goodies in there.
And then there's some... marginal stuff.
One thing that I wanted to look at was their 8x8 2D-IDCT source. The 8x8 2D inverse discrete cosine transform (IDCT) is popular and used in a number of video compression formats. There are a million ways to implement it quickly, and although everyone's seen Intel's AP-922 SSE2 algorithm for it by now, I hadn't seen one by AMD before. So I grab the source and dig around in the JPEG module, and I see this:
int IdctQuant_LS_SSE2(const Fw16s *pSrc, Fw8u *pDst, int dstStp, const Fw16u *pQuantInvTable) {
... pedx = (Fw16s *) fwMalloc(128); //64 array of Fw16s type
Who the #*@&*( calls malloc() in an optimized IDCT routine???
It looks like there are indeed a number of well-optimized SSE2 routines in the Framewave library, but after seeing things like the above a few times I was left scratching my head a bit....
Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable.
(Read more....)§ ¶VirtualDub 1.7.8 released
1.7.8 is out -- fetch!
Not really anything new, just a bunch of bug fixes. Most of them are for bugs and crashes in the capture module, but there are a couple that hit the editing module as well.
Those of you who frequent the forums know that I've been putting a lot of changes into another release, which was labeled as 1.7.X in test builds. That build is also nearly completed and is going to become the new 1.8.0 experimental build after it passes final checks to make sure it has an acceptably low level of stupid bugs. Unlike the jump from 1.6.x to 1.7.x, there will be no jump in minimum system requirements for 1.8.x; pretty much the only time that happens is when I'm forced to do so, and I'm not switching to VS2008 this time.
Changelist after the jump.
(Read more....)§ ¶The hidden danger of the Win32 TreeView
Random performance anecdote time.
I once had a bug filed on VirtualDub regarding a performance problem in its hex viewer on large files. (I have a habit of putting random features into my open-source tools; it never ceases to amaze me that people actually use them.) The problem turned out to be in a code fragment like this:
while(GetNextChunk(chunkInfo)) {
TVINSERTSTRUCT tvItem;
CreateTreeViewItem(tvItem, chunkInfo);
TreeView_InsertItem(hwndTV, tvItem);
}
I had expected that I'd done something stupid in the hex viewer code. When I profiled the routine under VTune revealed that for large files, though, I discovered that this routine was spending almost no time in VirtualDub.exe itself -- it was spending a huge amount of time in the TreeView_InsertItem() call. This is a call to the Win32 tree view control to insert an item. Investigation into the disassembly around the hotspot revealed that the Win32 tree view internally stores its nodes as a singly-linked list and adding an item to the end takes linear time according to the number of items. This meant that in order to add N items to the tree list, a total of N^2 steps were required, making the tree initialization quadratic time. In case you're not familiar with asymptotic complexity, here's my cheat sheet:
- O(1) / constant time: Wheeeeeeeee!!!!!
- O(N) / linear time: Very fast.
- O(N log N): Fairly fast, scales OK.
- O(N^2) / quadratic time: Go to lunch and hope it's done by the time you get back.
I ended up solving this problem in two ways: I changed the routine to insert items in reverse order at the beginning instead of in forward order at the end, and I split the chunk list into two levels to reduce the maximum child count within a tree node.
Scalability problems are the worst kind of performance issues to deal with because the performance effects can be drastic and the fixes dangerously invasive, i.e. rewrite. The main danger is that it's really easy to nest fairly fast operations and end up with a composite operation that is O(N^2) or worse. On more than one occasion I've seen people unnecessarily calling linear-time operations like strlen() in a loop, and that simple error ends up turning an ordinarily fast operation into a painfully slow one.
(Read more....)