§ ¶VirtualDub 1.8.1 released
I've finally pushed out 1.8.1 as a stable release, which promotes the new features added in 1.8.0 to stable status. If you are interested in features that were previously only in the 1.8.0 experimental version or are having trouble with the previous 1.7.8 stable version, I recommend checking out 1.8.1 to see if your issue has been fixed. This version was supposed to go out last week, but I was stymied by a login problem in the SourceForge upload system.
There are a couple of new features in this release, the main one being the distributed job support, which allows using different instances of VirtualDub to create and run jobs, and you can even use multiple machines if you get all the paths and plugins synchronized. It uses a filesystem-based sharing model, as it's designed for ease of setup with a few machines rather than a whole cluster farm. There is also now support for running the video compressor in a second thread for better dual-core performance, and now that I have a readily available 64-bit station again, I'm starting to bring the AMD64 build up to feature parity with the 32-bit version.
I've been really busy at work lately, so I'm afraid my progress on VirtualDub will be substantially slowed for the next few months (more so than usual), but I do have a few things already in flight for 1.8.2/1.8.3. Nothing big, but one of the things I've been doing is reversing some of my assembly-only features to C so I have an easily maintained baseline and so that the features can be enabled for AMD64. I've also been doing some prototypes here and there for the next big change, but nothing concrete's surfaced yet.
Changelist after the jump.
(Read more....)§ ¶3D graphics acceleration over Remote Desktop
A friend of mine tipped me off to an interesting post on DIRECTXDEV about GPU acceleration over Remote Desktop. After some investigation with a couple of machines, I discovered that yes, it is actually true that you can run 3D apps over Remote Desktop (Terminal Server). The key is that the machine you are remoting into must be running Windows Vista with a WDDM driver. Beyond that, it actually works as expected, although a bit slow. Some details:
- Again, the machine you are logging into must be running Windows Vista.
- The machine you are remoting from (running mstsc.exe) does not have to run Windows Vista.
- You do not have to use version 6 of the terminal services client -- version 5 that comes with Windows XP works fine.
- This doesn't appear to have anything to do with Avalon remoting, which allows Aero Glass to be shown over RDP. Supposedly you need both machines to be running Vista to do that, and you definitely need at least Ultimate or Enterprise on the target machine do it (I tried it with Business and it didn't work). 3D acceleration works fine with XP to Vista Business.
- You do not need any of the "experience" options enabled -- bitmap caching, desktop composition, effects, etc. They can all be disabled.
- Both OpenGL and Direct3D 9 work. Regular D3D9 is fine; D3D9Ex is not necessary.
- Direct3D 7 does not appear to work. I tried the old "I've got some balls" standby, and the only D3D driver that showed up was the software renderer (RGBRast).
- All 3D acceleration takes place on the server; only the image is sent to the client.
- Pretty much all D3D9 caps are intact, just as if the app were running locally. Even scanline queries work (!), although I'm not sure what it actually reads.
Now, the downside: it's slow. Really slow. So slow that I'd say it's basically unusable unless you're on a LAN, and a fast one at that. It looks like Terminal Services tries to send over the data from every Present(), and it blocks the app until that happens, instead of just skipping frames. It eats a lot of CPU in the process, too, instead of just waiting. I was just barely able to push 640x360 video at 24 fps with VirtualDub's D3D9 display minidriver, which was throwing about 10MB/sec across the gigabit LAN here -- probably about the best that the consumer-level hub can do. Readback doesn't seem to be a problem, at least not on the GeForce 6 and 8 cards I have here, because everything speeds back up if you cover enough of the 3D window.
Basically, this means that 3D support is usable with apps that just use it for 2D acceleration or otherwise static rendering, but anything dynamic like a video player or a game is going to be far too slow. It might be more viable if Microsoft had implemented frame dropping, but it looks like VNC may still be better for that. What this is definitely good for, though, is running GPU accelerated apps remotely. On XP, this isn't possible over Remote Desktop because as soon as you log in all your apps are pushed onto the software-only driver. On Vista, though, they could continue to run on the server-side GPU, and performance isn't a problem if the app isn't displaying the result continuously.
(Read more....)§ ¶A bit of an unfortuate icon mixup
Windows caches icons in several places, one being the shell, and another being in VRAM by the display driver. Whenever any of these caches goofs up, the wrong icons show up. This is a particularly unfortunate mixup:
![[Firefox with IE icon] [Firefox with IE icon]](/blog/images/iefs8.png)
Looks like the browser war is escalating inside my computer....
(Read more....)§ ¶Does VirtualDub do a color conversion when converting I420 to YV12?
No, current versions of VirtualDub do lossless conversions between I420 and YV12.
I420 and YV12 are FOURCCs for uncompressed video formats that use the YCbCr color space with 4:2:0 sampling and with the images stored as back-to-back, non-interleaved planes. The only difference between them is that the Cb and Cr planes are swapped in the encoding order. Parts of VirtualDub's video processing pipeline were made YCbCr-aware starting in 1.6.0, and this was extended over later versions up to 1.8.0, which makes the video filter pipeline YCbCr-capable as well. The video pipeline will still do format conversions as necessary, but when converting between YCbCr formats it is capable of simply reinterleaving the data or resampling the chroma planes without touching the luma plane. In the specific case of I420 and YV12, the two video formats map to the same internal format (YUV420_Planar), but with the two secondary plane entries swapped. Conversion between the two simply involves swapping the planes or doing three plane copies.
Sharp users will notice that the Video Color Depth dialog only allows you to choose YV12 and not I420. Well, that was a mistake on my part. When I originally reworked the video pipeline to allow YCbCr formats, I made the mistake of exposing the internal format as the setting in the pipeline configuration. As a result, when you select YV12 in that dialog, you're actually selecting the internal 4:2:0 planar format, and this later gets mapped in the back end to YUV420_Planar variant 0, which is YV12. I420 is variant 1, which you can't get to because the pipeline hardcodes variant 0. The same goes for Y800 vs. Y8. I suppose it wouldn't be too hard to push out the variant setting and add compatibility code in the script layer, but I haven't heard many requests for it (read: zero).
(Read more....)§ ¶A not so good way to decode Huffman codes
Having made a working Huffyuv decoder, I took a shot at making it faster than the 2.1.1 decoder... and failed.
I guess I'll present the algorithm here as a curiosity. Perhaps someone can suggest a variant that's actually worthwhile.
Huffyuv encodes its main bitstream as a series of interleaved Huffman-encoded codes, with the order being a repeating Y-Cb-Y-Cr for the YUY2 mode, and the output being a series of byte values or byte deltas, which then possibly go through the predictor. There are thus three variable length decoders involved. The traditional way to handle this is to break down each code into a series of tables and select the right table according to the next N bits in the stream, either by a branch tree of comparisons against the window value, or by some faster way of looking up the ranges. In particular, the Bit Scan Forward (BSF) and Bit Scan Reverse (BSR) instructions on x86 are attractive for this.
I decided to try a different tack, which was to build a state machine out of all three of the tables simultaneously, with the form: (current state, input byte) -> (next state, advance_input, advance_output, output_byte). There are a few advantages to doing this, one being that no bit manipulations are needed on the input stream since it is always consumed a byte at a time, and no branches at all are required. All three decoders live in the same state machine, so in the end, the decoding loop looks like this:
movzx edx, byte ptr [ecx] ;read next byte mov eax, [eax + edx*4] ;read next state mov [ebx], al ;write output byte and eax, 0ffffff00h ;remove output byte add eax, eax ;shift out input_advance and compute next_state*2 adc ecx, 0 ;advance input pointer add eax, eax ;shift out output_advance and compute next_state*4 adc ebx, 0 ;advance output pointer add eax, edi ;compute next state address cmp ebx, ebp ;check for end of decoding jne decloop ;if not, decode more bytes
Now, one of the annoying issues with trying to optimize a Huffman decoder, or a decoder for any sort of variable-length prefix code for that matter, is that it's an inherently serial procedure. You can't begin decoding the second code in parallel until you know where the first one ends. (Okay, you could if you had a parallel decoder for every bit position, but that's typically impractical.) That means the decoding speed is determined by the length of the critical dependency path, which in this case is 9 instructions (movzx, mov, mov, and, add, add, add, cmp, jne). I suspect I could trim that down a little bit if I hardcoded the base of the table and rearranged the code somewhat, but it turns out to be irrelevant. For the standard Huffyuv 2.1.1 YUY2 tables, the state machine turns out to be 1.8MB in size, and that means the decoding routine spends most of its time in cache misses on the instruction that fetches the state entry, the second instruction. Rats.
In the end, it did work, but was around 20-30% slower than a SHLD+BSR based decoder, at least on a T9300. That doesn't count the bitstream swizzle and prediction passes, but those take almost no time in comparison. It might be more lucrative for a smaller sized variable length code or one where the minimum code length is 8 bits and the conditional input advance could be dropped.
In general, it seems pretty hard to beat a SHLD+BSR decoder. This is particularly the case on a PPro-based architecture where BSR is very fast, two clocks for PPro/P2/P3/PM/Core/Core2, and one clock for enhanced Core 2 (!). The P4s seem a bit screwed in general, because while BSR is slow, so's practically everything else. Athlons are a bit weird -- they have slow BSR/BSF like the original Pentium and slow scalar<->vector register moves, but they're fast at scalar ops. That probably explains why I saw a branch tree instead of BSR the last time someone sent me a crash dump in Huffyuv 2.2.0's decoder....
I'm tempted to try putting a first-level table check on the decoder to see if that helps. The way this works is that you pick a number of bits, k, and determine how many codes are of length k or less. You encode those directly in a table of size 2^k as (code, length) pairs and everything else goes to more tables or an alternate decoder. Ideally, the effectiveness of this table is determined simply by how many entries are encoded directly in it, e.g. if 1800/2048 entries can use the fast path then you'll hit the fast path 87% of the time. In practice, this can vary depending on how well the distribution of the encoded data matches the distribution used to create the prefix code tree; they may not always match well when the tree is static, as is the case in vanilla Huffyuv. It's also questionable in this case because the advantage over the BSR is slight and the need for a fallback from the table requires adding an asymmetric but poorly predicted branch.
As a final note, most of these optimizations are only possible when Huffman codes are placed in the bitstream starting from the MSB of each word, as Huffyuv does, and as many standard encodings do, including JPEG and MPEG. Deflate's a pain in the butt to decode in comparison because it places codes starting at the LSB, which means you can't easily treat the bitstream as a huge binary fraction and use numeric comparisons, as the methods I've described above do.
(Read more....)§ ¶Gee, thanks a lot
It might surprise some of you to learn that I've never ripped a DVD. The main reason is that I don't really have an interest in doing so; my DVD collection is fairly small and I've already watched Alias and Slayers a thousand times. I don't travel much, either, so the portability issue hasn't cropped up.
I was tempted to learn today, though.
As I was walking out of the local neighborhood tech store, I spotted a DVD version of Office Space, which is a movie I like, but haven't seen in a while and never bothered to get, so I grabbed it and bought it along with the rest of the stuff I had. When I got home and started populating my new hard drive, I popped the DVD into the DVD player for background entertainment and waited for the standard FBI screen to expire so I could get to the main menu.
Only to be subjected to another clip comparing DVD piracy to shoplifting and other kinds of theft and some rather horrible music. That was practically a full minute long.
Now, I'm going to warn you that I really don't care to get into any arguments about whether copying a DVD is piracy or theft or copyright infringement or whether it is legal, ethical, moral, social, methyl, episcopal or whatever. I really don't care to have such a flamewar on my blog, and if anyone posts a comment about that I'm just going to delete it on the spot. I will say, though, that my thinking upon seeing this clip was more along these lines:
WHAT THE F*CK??? I BOUGHT YOUR DAMN DVD, THAT'S WHY I'M OUT TEN BUCKS! IF I HAD PIRATED YOUR STUPID MOVIE, THEN I'D BE WATCHING IT RIGHT NOW INSTEAD OF SITTING THROUGH YOUR PATRONIZING, UNSKIPPABLE PROPAGANDA VIDEO!!! WHY THE F*CK ARE YOU PUNISHING ME???
I haven't bought a DVD in a while and I had no idea they were now doing this. Putting in a 1-2 second FBI warning is one thing, but why should I buy a DVD ever again if they're going to force this irritating, condescending garbage down my throat every time I want to watch a movie that I legitimately purchased? Is it any wonder that people pirate movies now with this stupidity going on?
Update:
I received a request by email to tone down the language in this post. I refuse to do this. I am very offended by the contempt shown by the addition of this video and I want to make it very clear that I consider the addition of such a video to a product to be very inappropriate. A friend of mine brought up the very good point that he has videos intended for children that have this video prepended, which starts with a clip of a woman having her purse stolen. I don't condone piracy -- but this kind of treatment is NEVER called for.
(Read more....)