What is VirtualDub?
VirtualDub is a video capture/processing utility for 32-bit Windows platforms (95/98/ME/NT4/2000/XP), licensed under the GNU General Public License (GPL). It lacks the editing power of a general-purpose editor such as Adobe Premiere, but is streamlined for fast linear operations over video. It has batch-processing capabilities for processing large numbers of files and can be extended with third-party video filters. VirtualDub is mainly geared toward processing AVI files, although it can read (not write) MPEG-1 and also handle sets of BMP images.
I basically started VirtualDub in college to do some quick capture-and-encoding that I wanted done; from there it's basically grown into a more general utility that can trim and clean up video before exporting to tape or processing with another program. I released it on the web and others found it useful, so I've been tinkering around with its code ever since. If you have the time, please download and enjoy.
§ ¶stdext::hash_set
A couple of days ago I found myself writing a breadth-first search routine in C++ for a particular puzzle (Klotski, to be specific). Since I was using Visual C++, I needed a set for the board database, and decided to use its stdext::hash_set.
stdext::hash_set... sucks.
See, Visual C++'s Standard Template Library (STL) is based on the implementation from Dinkumware, and has the following prototype:
template<class Key,
class Traits = hash_compare<Key, less<Key> >,
class Allocator = allocator<Key> >
class hash_set;
Notice two things about this hash set implementation: it takes a less-than predicate, and it doesn't take an equality predicate. What the frick kind of hash table requires a less-than predicate??? And we're not talking about a predicate for the hash code here, but a less-than comparison operator for the whole key. It defeats one of the main advantages of a hash container, since IMO it's often easier to write hash and equality predicates instead of a less-than predicate, as long as you aren't dealing with floats. I've never seen a hashed container that required this -- not in SGI-STL/STLport, not in Java, not in the .NET Framework. As you might expect, this means that VC8's hash_set is a pain in the butt to use and is slow. Thankfully, C++ TR1's unordered_set follows the SGI-STL/STLport path rather than Dinkumware's.
I haven't really had a need for a full blown hash_map or a hash_set in VirtualDub, but I'm thinking that if I do, I'll be better off writing my own....
(Read more....)
§ ¶SSE4.1, first impressions
Trimbo's been bugging me about SSE4.1 lately and my experiences with it.
Well, I've just starting playing around with it, now that I have the tools and build set up, and my experience has been mixed.
The main problem is alignment. To make good use of a Core 2+ and of SSE4.1, you have to go from mmx (64-bit) to xmm (128-bit) registers. The annoyance that comes along with this is that while 64-bit memory accesses can be unaligned, all 128-bit accesses on x86 currently have to be aligned or you fault. That means it isn't as trivial as just taking a 64-bit loop and processing pairs of pixels at a time. It's true that misaligned loads hurt performance, but there are two mitigating factors in 64-bit vector code. One is that on modern CPUs, not all misaligned accesses cause a penalty -- only ones that cross L1 cache line boundaries and trigger a DCU cache split do. This means that for sequential accesses only a quarter or less of your loads fault, which may be acceptable if the cost of avoiding the misalignment is higher. The other factor is that when processing pixels, it's common to have to expand 8-bit unsigned channels to 16-bit signed, which means that the loads are frequently 32-bit and thus may always be aligned. Going to 128-bit vectors and 64-bit loads spoils this. What this all means is that several algorithms that I looked at for rewriting in SSE4.1 looked great until I examined the load/store paths and realized that I was going to burn more time in dealing with misalignment than I would save in the faster ALU operations.
You might say that memory buffers should just be aligned, and yes, you can do that to some extent, particularly with temporary buffers. The gotcha is that you don't always control the buffers involved and at some point you simply can't control the alignment. For example, display buffers probably aren't guaranteed to be 16 byte aligned, nor would a GDI DIB. Similar problems occur at the ends of scanlines, related to the width of the image; it's lame and inflexible to just have your library require that all working images be multiples of 4/8/16 pixels. Working with .NET? Oh, sorry, the GC heap doesn't support alignment at all -- screwed. The compromise that I generally shoot for is to get a routine that will work with odd widths or non-optimal alignment, although it might be a little slower due to fixup.
There are also cases that simply don't scale to larger vectors. For example, in a scaling or rotation algorithm, I might need to pull 32-bit pixels at various locations and expand them to 64-bit for processing. What am I going to do with 128-bit vectors? I can't pull pairs of pixels from each location, because I only need one. I could process more pixels in parallel, except that I only have eight registers and having four source pointers is hard enough as it is. It's more doable in long mode with 16 GPRs, but I haven't even gotten to that yet.
In terms of instruction selection, the SSE4.1 instruction that seems most useful so far is PMOVZXBW, because it replaces the load/move+PUNPCKLBW that commonly use without eating an extra register for zero. PBLENDW is also looking useful for some alignment scenarios. Other than that, most of the other instructions I think I can abuse are actually from SSSE3, because I'm not that interested in the floating point part of SSE4.1 for VirtualDub. In SSSE3, PHADDD (packed horizontal add doubleword) is turning out quite useful, because I frequently do high precision integer dot products and that means PMADDWD followed by a horizontal add. PSHUFB is also promising, especially given the high speed on SSE4.1-capable CPUs, but that it requires the shuffle pattern to be in memory or in a register and that it works in-place are annoying. PALIGNR looks useful but often requires unrolling due to the immediate argument.
The 8-bit resamplers -- which are used in 1.8.0 when using the "resize" filter with YCbCr video -- got about 20-25% faster with SSE4.1 in my first attempt compared to the MMX version. Unfortunately, I don't know how much this is due to SSE4.1, or just due to the move to 128-bit vectors, since Core 2 is twice as fast at those and Enhanced Core 2 is even faster. I have some ideas for abusing PBLENDW and PSHUFB as well for optimizing conversion between RGB24 and RGB32, but the alignment issue is a bear. I've also been thinking about whether I can speed up the RGB<->YCbCr converters, but PMADDUBSW is the most promising there and the coefficient precision there would be marginal. I also got the idea to abuse MPSADBW for a fast box filter, although the fixed radius would be a bit restrictive and it'd only help horizontally, and I'm not sure what I would use it for.
Overall, I'm not seeing a revolution here compared to what I was doing with MMX and SSE2, but it is a bit nicer overall -- I'm not spending as much time doing register-to-register moves and unpacks as I did. My guess is that if you already have a CPU that's at least SSSE3 capable (Core 2), then you're already going to get most of the benefits instruction-wise, with the difference you're missing being in the microarchitecture and not in the lack of SSE4.1. I'm also beginning to see some strengths and weaknesses of SSSE3/SSE4.1 against AMD's SSE5, at least for image processing. The data movement capabilities of SSSE3/SSE4.1 look superior, but SSE5 has some really compelling ALU operations: PMADCSSWD (packed multiply, add and accumulate signed word to signed doubleword with saturation) looks perfect for what I do. The main question is how fast AMD can get it. I'd heard that the fused-multiply-add unit in Altivec-capable PPC chips was a problem in terms of gating clock speed and the solution was to pipeline it to the point that it became less compelling due to latency; we'll see what happens with SSE5.
(Read more....)