Trimbo's been bugging me about SSE4.1 lately and my experiences with it.
Well, I've just starting playing around with it, now that I have the tools and build set up, and my experience has been mixed.
The main problem is alignment. To make good use of a Core 2+ and of SSE4.1, you have to go from mmx (64-bit) to xmm (128-bit) registers. The annoyance that comes along with this is that while 64-bit memory accesses can be unaligned, all 128-bit accesses on x86 currently have to be aligned or you fault. That means it isn't as trivial as just taking a 64-bit loop and processing pairs of pixels at a time. It's true that misaligned loads hurt performance, but there are two mitigating factors in 64-bit vector code. One is that on modern CPUs, not all misaligned accesses cause a penalty -- only ones that cross L1 cache line boundaries and trigger a DCU cache split do. This means that for sequential accesses only a quarter or less of your loads fault, which may be acceptable if the cost of avoiding the misalignment is higher. The other factor is that when processing pixels, it's common to have to expand 8-bit unsigned channels to 16-bit signed, which means that the loads are frequently 32-bit and thus may always be aligned. Going to 128-bit vectors and 64-bit loads spoils this. What this all means is that several algorithms that I looked at for rewriting in SSE4.1 looked great until I examined the load/store paths and realized that I was going to burn more time in dealing with misalignment than I would save in the faster ALU operations.
You might say that memory buffers should just be aligned, and yes, you can do that to some extent, particularly with temporary buffers. The gotcha is that you don't always control the buffers involved and at some point you simply can't control the alignment. For example, display buffers probably aren't guaranteed to be 16 byte aligned, nor would a GDI DIB. Similar problems occur at the ends of scanlines, related to the width of the image; it's lame and inflexible to just have your library require that all working images be multiples of 4/8/16 pixels. Working with .NET? Oh, sorry, the GC heap doesn't support alignment at all -- screwed. The compromise that I generally shoot for is to get a routine that will work with odd widths or non-optimal alignment, although it might be a little slower due to fixup.
There are also cases that simply don't scale to larger vectors. For example, in a scaling or rotation algorithm, I might need to pull 32-bit pixels at various locations and expand them to 64-bit for processing. What am I going to do with 128-bit vectors? I can't pull pairs of pixels from each location, because I only need one. I could process more pixels in parallel, except that I only have eight registers and having four source pointers is hard enough as it is. It's more doable in long mode with 16 GPRs, but I haven't even gotten to that yet.
In terms of instruction selection, the SSE4.1 instruction that seems most useful so far is PMOVZXBW, because it replaces the load/move+PUNPCKLBW that commonly use without eating an extra register for zero. PBLENDW is also looking useful for some alignment scenarios. Other than that, most of the other instructions I think I can abuse are actually from SSSE3, because I'm not that interested in the floating point part of SSE4.1 for VirtualDub. In SSSE3, PHADDD (packed horizontal add doubleword) is turning out quite useful, because I frequently do high precision integer dot products and that means PMADDWD followed by a horizontal add. PSHUFB is also promising, especially given the high speed on SSE4.1-capable CPUs, but that it requires the shuffle pattern to be in memory or in a register and that it works in-place are annoying. PALIGNR looks useful but often requires unrolling due to the immediate argument.
The 8-bit resamplers -- which are used in 1.8.0 when using the "resize" filter with YCbCr video -- got about 20-25% faster with SSE4.1 in my first attempt compared to the MMX version. Unfortunately, I don't know how much this is due to SSE4.1, or just due to the move to 128-bit vectors, since Core 2 is twice as fast at those and Enhanced Core 2 is even faster. I have some ideas for abusing PBLENDW and PSHUFB as well for optimizing conversion between RGB24 and RGB32, but the alignment issue is a bear. I've also been thinking about whether I can speed up the RGB<->YCbCr converters, but PMADDUBSW is the most promising there and the coefficient precision there would be marginal. I also got the idea to abuse MPSADBW for a fast box filter, although the fixed radius would be a bit restrictive and it'd only help horizontally, and I'm not sure what I would use it for.
Overall, I'm not seeing a revolution here compared to what I was doing with MMX and SSE2, but it is a bit nicer overall -- I'm not spending as much time doing register-to-register moves and unpacks as I did. My guess is that if you already have a CPU that's at least SSSE3 capable (Core 2), then you're already going to get most of the benefits instruction-wise, with the difference you're missing being in the microarchitecture and not in the lack of SSE4.1. I'm also beginning to see some strengths and weaknesses of SSSE3/SSE4.1 against AMD's SSE5, at least for image processing. The data movement capabilities of SSSE3/SSE4.1 look superior, but SSE5 has some really compelling ALU operations: PMADCSSWD (packed multiply, add and accumulate signed word to signed doubleword with saturation) looks perfect for what I do. The main question is how fast AMD can get it. I'd heard that the fused-multiply-add unit in Altivec-capable PPC chips was a problem in terms of gating clock speed and the solution was to pipeline it to the point that it became less compelling due to latency; we'll see what happens with SSE5.(Read more....)
After thinking a bit about the SSE4.1 problem, I decided that the best route for taking advantage of SSE4.1 would be through external assembly, and that the best way to do that was to ditch MASM. There are various workarounds to try to kludge MASM into assembling alternate opcodes, but after being unable to quickly find a workable SSE4.1 macro set and after a failed attempt to make one myself (as well as rediscovering the mess that is MASM's macro facilities), I decided to go for plan B. I'd been wanting to do this for a while for a few reasons. One reason is the half hearted way in which Microsoft has been maintaining MASM, including:
- Making trivial and undocumented breaking changes to MASM that keep requiring me to modify my assembly language code, like changing the parameter sizes for the memory argument of MOVD and MOVQ.
- Chopping functionality out wholesale in the ML64, the AMD64 (X64) version.
- Sorely incomplete documentation in MSDN. (OPATTR returns a low byte identical to .TYPE. For information on .TYPE, see OPATTR.)
The other reason is MASM's availability, since it's only normally available in Professional Edition or higher. Oh, but you can sometimes get a download for non-commercial use, when they remember to update it, and you can grab it out of the Windows SDK, but you have to make sure not to mix up the bin paths since the SDK compiler isn't always compatible with the VC++ headers. Ugh. MASM is the only thing that prevents the 32-bit version of VirtualDub from being built with VC++ Express, a restriction which I've wanted to lift even though I don't use Express myself for that purpose.
So I spent some time last night porting all of my assembly language to YASM.
The reason I chose YASM is that it has support for x64 Windows, and more importantly, for VC-compatible line number debugging information. I had looked at NASM before, but dropped the idea due to the lack of debug info support. YASM, on the other hand, does, and it looks like a lot of other work has been put into making it VC friendly, such as adding support for emitting errors in VC compatible form. YASM shares NASM syntax, and while it's far closer to Intel syntax than GAS syntax is, it's unfortunately gratuitously different enough to make the conversion non-trivial. I was able the do a lot of the conversion with an ugly Visual Studio macro (why oh why do I have to use Visual Basic?!?), but I still had to fix up a lot of assembly by hand. Among the changes I had to do:
- .model flat, .code, .686, .mmx, .xmm: bye bye, gone.
- PROC doesn't exist, so all instances of "_foo proc public" had to change to "global _foo / _foo:" and all other labels had to become local labels.
- "ptr" is forbidden: dword ptr [eax] -> dword [eax]
- Had to add brackets around all absolute memory accesses: mov eax, foo -> mov eax, [foo]
- xmmword -> oword
- Structures and macros had to be rewritten, although I have to say that NASM/YASM macros make a lot more sense.
- Commented out frame pointer omission (.FPO) statements, since YASM doesn't support emitting FPO debug records.
It took me about four hours to convert all 700K of assembly language, after which I actually did end up with a working build of VirtualDub. I even got it hooked in cleanly via a custom build rule. So far, so good. However, as you might have guessed, that means I now have an enormous test coverage problem, because that 700K of asm covers about fifty different features in VirtualDub and only a small fraction of them have test cases. Worse yet, there are about five different CPU levels involved (scalar, MMX, SSE, ISSE, SSE2). Writing all of the test cases necessary to get complete coverage would probably take a very long time, so for now what I'm probably going to do is just do a DUMPBIN /DISASM on both the MASM and YASM based builds and see if I can do verification by automated diff.
Haven't actually gotten to writing any SSE4.1 code yet, but I'm getting there....(Read more....)
I have the day off from work today, so I decided to sit down and try some SSE4.1 experiments... only to discover the following:
- Intel VTune 6.1 doesn't work at all on this laptop. (Yeah, it's ancient, but it's reliable, fast, and worked great with a Pentium M, SSE2, and VC8.)
- AMD CodeAnalyst 2.76 works in timer mode, but can't disassemble past SSE4.1 instructions.
- VC8 (VS2005 SP1) can't assemble SSE4.1 instructions in inline assembly.
- MASM 8 can't assemble SSSE3 or SSE4.1 instructions.
- The VS2005 SP1 toolchain can't disassemble SSSE3 or SSE4.1 instructions.
- VC9 (VS2008) handles SSE4.1... but I only have the Express version, which doesn't have MASM, and I can't switch to VS2008 anyway for VirtualDub 1.8.x due to system requirements issues.
- Agner Fog's pentopt tome hasn't been updated for Penryn, and I can already tell that a number of SSE2 instructions are quite a bit faster than on the original Core 2 Duo.
- Intel's x86 Optimization Guide has been updated for Penryn, but the formatting is almost unusable and they don't include ľop breakdowns for memory ops -- which means that I can't tell, for instance, whether LDDQU and MOVDQU are improved.
It's not a dealbreaker, since using MASM macros isn't bad and I'm used to RDTSC profiling, but sheesh... talk about missing prerequisites.(Read more....)
Got this while trying to install Adaptec GameBridge video capture drivers on my new system:
If you write code to measure the CPU speed, do remember to put in a check for ridiculous values. At least there is a "Continue Anyway" button.(Read more....)
For some reason, all of the laptops I've owned have had a problem with volume. Specifically, the volume adjustment buttons don't have enough granularity, so the bottom five ticks correspond to silent, loud, louder, loudest, and 11. I can combat this by turning down the mixer source line volumes instead, but I've had the misfortune of discovering software vendors that decided to solve their customer support problems by making their programs shove the MIDI volume all the way up on startup. I miss my old SoundBlaster 16 ASP, which had a good old fashioned volume knob on the back that programs couldn't touch no matter how many I/O ports they tweaked.
Fortunately, a while ago, after reading a post from Larry Osterman about volume in Windows, I played around with some curves and came up with the following Registry patch to set a nicer volume taper:
Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Multimedia\Audio\VolumeControl] "EnableVolumeTable"=dword:00000001 "VolumeTable"=hex:00,00,00,00,30,00,00,00,65,00,00,00,9f,00,00,00,e0,00,00,00,\ 28,01,00,00,77,01,00,00,d0,01,00,00,31,02,00,00,9d,02,00,00,15,03,00,00,99,\ 03,00,00,2c,04,00,00,cf,04,00,00,82,05,00,00,4a,06,00,00,26,07,00,00,1b,08,\ 00,00,2a,09,00,00,55,0a,00,00,a1,0b,00,00,11,0d,00,00,a8,0e,00,00,6b,10,00,\ 00,5e,12,00,00,87,14,00,00,eb,16,00,00,91,19,00,00,80,1c,00,00,c0,1f,00,00,\ 59,23,00,00,54,27,00,00,be,2b,00,00,a1,30,00,00,0a,36,00,00,08,3c,00,00,aa,\ 42,00,00,03,4a,00,00,27,52,00,00,2a,5b,00,00,25,65,00,00,32,70,00,00,6f,7c,\ 00,00,fd,89,00,00,00,99,00,00,9f,a9,00,00,08,bc,00,00,6b,d0,00,00,ff,e6,00,\ 00,ff,ff,00,00
Usual disclaimers apply; you screw up your system, your fault. Unfortunately, I can't find the little C program I wrote to generate this taper, but I remember it was something like a 5th order polynomial. Someone can probably throw it into Excel and reverse engineer it or something. After all the Vista bashing from a couple of days ago, I should at least report that this wasn't necessary in Vista, which already had a much nicer taper with the x64 driver I installed.
One issue I haven't been able to crack is the volume of the emulated PC speaker. For some reason the beep tends to be VERY loud on laptops, even louder than the wave audio. If I could find a display as bright as the beep was loud I'd have a Windows XP driven stun grenade. I've only found two crappy solutions: one is to disable the beep service (which for some reason isn't in the services UI but can be accessed with "net stop beep"), and the other is the power saving options for the integrated sound, which on my new laptop has an option to disable the PC speaker.(Read more....)
Midway through the development of the VirtualDub 1.5.x series, one of the decisions that I made was to deprecate the old VBitmap-based blit library and write a new library based on what I call pixmap objects. These were new descriptor structures that solved a number of annoyances in the old VBitmap structure, such as:
- not being able to distinguish between 555 and 565 16-bit RGB
- not supporting YCbCr or multi-plane formats
- *#($&# upside down encoding
In addition, I started writing a unified blit interface (VDPixmapBlt) that would support orthogonal blitting between any two formats, with the only exception being paletted formats, which were source-only. This turned out to be a great decision, because now I don't have to worry about whether a conversion exists for a particular path and can just do blits as I need them. This isn't always advisable for performance or quality reasons -- Y8 is not a good intermediate format for conversion from 24-bit RGB to 32-bit RGB -- but it's wonderful for less critical paths such as UI and just migrating old code in general. It's the main reason that the number of RGB-only paths in VirtualDub has been shrinking over time and will continue to dwindle.
I've now begun to run up against the limits of this current scheme, and a while ago I started toying with a new scheme I call uberblit, which is a scanline-based scheme consisting of a tree of generators. The main motivations behind uberblit are to increase flexibility (dithering, chroma resampling options, more formats) and to eliminate the current table-based scheme where I have to initialize an O(N^2) table of blitters. It's turned out to be more complex than I like and hasn't yet become a good general purpose replacement for the existing blitter, but it has some advantages in terms of flexibility and data cache strategy, and it was useful enough that I pulled it into 1.8.0 for YCbCr resampling.
As part of some work to see if I could make it usable for general blitting, I wrote a benchmark to compare its performance against the existing blitters, and the result wasn't pretty. For conversions the uberblitter is almost always slower than the primary blitter, usually because either there are extra conversions, optimized span routines aren't hooked up, or simply just overhead in the row fetch paths (a few layers of virtual calls can be slower than the turbo SSE2 routine it controls). After a little bit of tweaking I got depressed, since it was pretty obvious that it would take more than hooking up a couple of routines to get uberblit up to snuff. Still, being able to more easily handle a wider variety of formats like YCbCr with PAL DV style chroma and float3/4 is a compelling reason to keep trying.
Anyway, back to the subject... after getting my new laptop, I decided to compare the two quickly just to see what the performance of the new CPU was like. The old laptop is a 1.8GHz Pentium M, while the new laptop is a current generation Core 2 Duo with SSE4.1 support ("Penryn" core). The dual core part is irrelevant here as the blitters are single threaded, as is SSE3/SSSE3/SSE4.1 as I don't use them yet, but the Core 2 Duo part is important as the extra execution unit and the single clock SSE throughput are very applicable.
Results are after the jump (all results in megapixels/second), but the gist is that the Penryn-based system is a lot faster at running the existing blitters than I had expected given the GHz ratings. The most likely reason is memory and the front side bus being faster, because the direct blit (no conversion) cases are ridiculously faster and those definitely are not execute bound as they're memcpy() based. It's also possible that the Core 2 Duo is cheating due to larger caches, but it wins handily even in the cases where heavy computation is involved (RGB<->YCbCr conversion with chroma subsampling). The blitters are also designed to reduce this effect by interleaving calculations via scanline stripes whenever multipassing is required, so that data flows in and out of the caches smoothly instead of having performance drop off a cliff after a certain bitmap size.(Read more....)
Windows Vista is bloated.
No, I mean, it's bloated. Really bloated. Massively waterlogged needs-immediate-surgery bloated.
My current Windows XP SP2 installation takes 6GB. Of that, about 4GB of it is the fully patched XP installation itself, with the other ~2GB including an installation of Visual Studio 2005 SP1 and some miscellaneous tools such as OpenOffice. Therefore, I figured that devoting 15GB of my 120GB laptop hard drive to a Vista x64 system drive would be more than adequate, given a 2.5x multiplier. I installed Vista x64, ran Windows Update, and let it do its thing, which took a loooong time... only to find out that I was running low on disk space. Yes, I know about the hard links in the Windows folder blah blah blah, but this was an actual popup warning from Explorer itself and the drive did actually have less than 500MB free.
Let me repeat that again: I had a Vista x64 installation, with NO applications installed on it, that was running out of disk space just patching. And we're not even talking about Vista SP1 -- this is pre-SP1 patching. Vista SP1 says requires a minimum of 5GB to install, which is a bit remarkable given that it's a 700MB patch... and filling an NTFS drive to the brim is not a good idea.
I dug around a bit on the C: drive and managed to do a couple of things to free up disk space -- got rid of the hiberfil.sys file, moved the page file onto another partition, nuked all volume shadow copies, etc. In the end, I still only ended up with about 5GB of space free, with over 9GB being taken up in the Windows directory, of which about 5GB is in WinSXS (which is basically the new DLL cesspool). And I still don't have any applications installed. None. Perhaps I should change the boot entry to read Windows Vista Capable, since it's good for nothing.(Read more....)
I finally broke down and bought a new laptop... not that my old one was broken or anything, but I wanted something a little lighter than the 17" Inspiron I had before, and also something with a bit more CPU support than SSE2. The new laptop is a Latitude D830 with a 2.5GHz dual-core Penryn, which means I can now play with SSE3, SSSE3, and SSE4.1.
Well, now that my development environment is set up.
The stability of Windows, and the rate at which I install software, has slowed to the point that I generally don't reinstall Windows except in the event of a major hardware change. As a result new machines are an opportunity for me to wipe the slate clean. The problem in this case is that I had forgotten about the current mess that is the Windows SDK situation. Specifically, the migration of the DirectShow SDK from the DirectX SDK to the Platform SDK (now the Windows SDK), is still broken. VirtualDub now requires the DirectShow SDK to compile due to WDM capture support, so this is a bit of a showstopper. Unfortunately, that's not the only problem.
In a nutshell:
- dxtrans.h is missing from both the current Windows SDK (Windows Server 2008 + .NET Framework 3.5) and the current DirectX SDK (March 2008), and it's required by the DirectShow headers. Microsoft is considering fixing this in the Windows 7 SDK.
- Starting with the Vista version of the Windows SDK (v6), there is a conflict between intrinsics definitions in winnt.h and VS2005's intrin.h that causes a compile error if both are included.
After screwing around with this for about ten minutes, I got tired of it and just went back and installed the same Windows Server 2003 SP1 and DirectX August 2007 SDKs that I had on my old laptop. I shouldn't be changing base SDKs in a point release, anyway. Still, it bothers me that the Windows SDK versioning situation is so fragile; it's a bit annoying to explain to someone else how to build VirtualDub.
Of course, even after all of this, I haven't even gotten to checking the x64 build yet. My desktop has been on the fritz lately (nForce4 based motherboards suck), so I haven't actually been able to test the 64-bit build of VirtualDub in a while. In theory I can dual-boot into Vista64 now and get both my Vista and x64 testing done, but we'll see how that goes....(Read more....)