Last time I talked about a faster way to do parallel averaging between 8-bit components in SSE with unequal weights, specifically the [1 7]/8 case. This time, I'll improve on the [1 7] case and also show how to do the [3 5] case.
To recap, the idea centers around using the SSE pavgb instruction to stay within bytes by rounding off LSBs correctly each time we introduce a new intermediate result with an extra significant bit. Previously, I did this by converting a subtraction to an addition and using a correction factor. It turns out that there's another way to do this that doesn't require a correction factor and requires fewer constants, which means more temp registers available.(Read more....)
I hereby declare today stupid assembly tricks day.
I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:
- 7/8, 1/8
- 5/8, 3/8
- 3/8, 5/8
- 1/8, 7/8
This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a [5 3]/8.
Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:
In VirtualDub 1.8.5, there is a bug where if you try to edit the name of a job in the job control dialog, the job control dialog disappears as soon as you type any character.
Needless to say, I was rather confused when I received this bug report... partly because I didn't realize anyone actually used this feature, and partly because I couldn't think of what would cause it.
It turns out that this bug is caused by a somewhat poor choice of ID in the Windows ListView common control. When the user clicks on a ListView entry to edit its label, the ListView creates a child edit control. There are two problems with the way it does this:
- The ListView forwards most of the child edit control's WM_COMMAND notifications up to the parent.
- The edit control uses an ID of 1, which happens to be the same as IDOK.
The result is that the job control dialog receives an EN_CHANGE notification as a WM_COMMAND message with a control ID of IDOK and thinks it's time to close the dialog box. Whoops. Funny how they left this little detail out of the MSDN documentation....
Old versions of VirtualDub didn't have this problem because the job control dialog handled WM_COMMAND notifications in a different way. A little message filtering will fix the problem in 1.8.6.(Read more....)
My current technique for testing A/V sync in VirtualDub is to hook my capture device up to a PlayStation 2 and then play a game of Guitar Hero 2 or Rock Band through my laptop. How poorly I do indicates how much lag there is on the real-time display. (If I do well, it means that there is possibly also a problem with the universe.) I can then play back the capture file to see how well the resync engine performs and to compare VirtualDub's preview mode against DirectShow sync. And I get to hear some decent music, too.
Imagine my surprise when I started playing the capture AVI file and Media Player Classic started stuttering horribly.
My first instinct was that maybe something had gone horribly wrong with the interleaving code or that my disk was terribly fragmented, but after bringing up FileMon it was quickly apparent that the problem was actually Visual Studio. Specifically, Visual Studio had suddenly decided that it wanted to rebuild the Intellisense database for VirtualDub, and had started thrashing the disk, which in turn completely screwed my video playback. So then I had to wait a few moments while Visual Studio took its sweet time until I could actually play video again.
I'm now of the opinion that running tasks in the background is a bad idea and contributes to the opinion that computers are slow and unpredictable. In this case, the Visual Studio programmers got the idea that Intellisense could be made low impact and "invisible" by running it as a low priority background thread, which ignores the problem that it still drains a huge amount of I/O bandwidth on the hard disk. This isn't the first time the Visual Studio team has done something like this -- I love it when the .NET Framework updates and suddenly a lot of CPU is taken up by "background" NGEN tasks for minutes at a time. If everyone did this, computers would be stuck at 100% CPU for hours at a time with no apparent reason.
We've now gotten to the point where computers are fast enough to perform many everyday computing tasks much faster than necessary, which unfortunately has given rise to the opinion that we don't need efficient code anymore. While it's true that there's no need to make a window come up in 0.5ms when it currently comes up in 5ms, I think the recent trend toward low-power, quieter, and lighter-weight devices is providing a new reason for efficient code. For many programs, it'll no longer be about making the program run faster, but to make it run more efficiently, letting the CPU and GPU idle more, allowing the hard drive and fans to spin down, and prolonging battery life on laptops. This is nothing new to people who worked on older platforms or on mobile devices, but this kind of thinking isn't as common in the desktop world. Heck, in the DOS days there wasn't even a way to idle the CPU... well, unless you cared about Windows 3.1 and issued the obscure INT 2F/AX=1607/BX=0018.
Now, whether people will actually care about this, I don't know. The current trend on Windows toward rendering everything in a UI every frame and/or in software rendering isn't heartening, and "efficient operation for prolonged battery life" isn't something I normally see as a bullet point in program advertisements. With the rate of hardware advancement slowing and the rise of lower powered "netbook" devices like the Asus Eee, though, I'm hoping to see a bit of a reversal here. I don't need my laptop to run hotter than it already does.(Read more....)
1.8.5 is out -- just a little quick bug fix release. The changelog says the 15th because I actually built it on Friday, but hadn't gotten a chance to publish it until today. There are just two fixes in this version, one for a crash issue that some people were seeing in capture mode, and another for a filter compatibility issue that affected my subtitler.
(Read more....)Build 29963 (1.8.5, stable): [August 15, 2008]
* Capture: Fixed a possible crash when loading device settings.[regressions fixed]
* Video filters which used GDI rendering in in-place mode but only requested
a DC for one buffer now work.
VirtualDub 1.8.4 is out -- it is a new stable release consisting of mostly bug fixes. Upgrading to 1.8.4 is recommended if you work with the distributed job list system, process DV type-1 videos, or use the capture system from the command line. There is also a fix for display issues if you use Direct3D display mode and work with videos that are larger than the screen resolution.
Changelist after the jump.(Read more....)
Windows Presentation Foundation (WPF) has an interesting feature called Remoting where it is able to render primitives over Remote Desktop by sending primitives over the network instead of raw bits. While I believe this already happened in XP for some GDI primitives, as far as I know this is the only way to send hardware accelerated 3D across the wire, since regular Direct3D rendering gets remoted at the bitmap level instead of the primitive level. Well, according to a blog post from a WPF member (http://blogs.msdn.com/jgoldb/archive/2008/05/15/what-s-new-for-performance-in-wpf-in-net-3-5-sp1.aspx), this got neutered not too long ago:
Although we have not improved this scenario, it is important to highlight some differences related to Remoting
On .Net Framework 3.0 and .Net Framework 3.5:
- Vista to Vista with DWM on, we Remoted content as Primitives (e.g. the channel protocol went over the network) (This is for the Remote Desktop case only, not Terminal Server)
- In other cases: we Remoted content as Bitmaps
On .Net Framework 3.5 SP1
- We now remote as bitmaps in ALL cases.
- The reason is that WPF 3.5 SP1 now uses a new graphics DLL (wpfgfx.dll) and certain changes could not be made to Vistaís existing graphics DLL (milcore.dll) that is also used by DWM.
- Although this could be seen a regression at first, depending on the complexity of the application scene (e.g. very rich scenes) this can actually improve performance in certain scenarios . Also, connections with reasonably high bandwidth and scenarios that donít involve a lot of animation or 3D, for instance, tend to remote just fine via bitmaps.
Ouch. I haven't looked into this deeply, but if I interpret this correctly, this means that the remoting advantage of WPF is effectively gone with .NET 3.5 SP1, because WPF apps are now going to be remoted just like any other app that's using a rendering method other than GDI, i.e. as an image. The comment about wpfgfx.dll vs. milcore.dll also implies that there isn't much, if any, advantage that WPF gives you over Direct3D API-wise. For a while I was wondering if it would be worth trying hacks to get WPF or at least MILCORE usable from native code without having to spin up the CLR, but this is looking increasingly less useful since I already have a reasonably powerful Direct3D9 layer and an image resampler that is probably faster than WPF's new shader JIT. Not to mention that, unless it's been added since the original API, WPF's support for offscreen rendering is a bit lacking.
I do wonder about the pixel shader JIT, though... as usual I take any claims about speed from Microsoft with skepticism, but on the other hand, it can't possibly be any slower than refrast. Anyone played around with the new JITter? Show me the assembly. :)(Read more....)
Microsoft WinDbg is part of the Debugging Tools for Windows package and is a fairly powerful, and free, debugger. I like to keep it around because it's much quicker to obtain and install than Visual Studio and is sometimes more helpful for debugging crashes in cases where Visual Studio acts oddly or is otherwise unable to extract the needed information. Although it has a GUI, it's mostly a command-line debugger, and as such is used somewhat differently than the Visual Studio debugger. I've found that WinDbg is less user friendly and harder to use for interactive source debugging, but much more powerful for difficult or post-mortem situations. One useful advantage is that if the Windows GUI gets wedged -- which tends to happen when hooks with global mutexes go awry -- you can still use CDB, the command-line version of WinDbg, because console windows are specially handled by CSRSS and often don't lock up with the rest of the UI.
Anyway, the cheat sheet:
The first step you need to do generally is to get symbols hooked up for your executable or crash dump so you're not flying blind. When one or more of these are missing, use .exepath and .sympath to set the executable and symbol paths. It's very helpful to also use .symfix+ to hook up the Microsoft symbol server so you get symbols and stack unwinding information for Windows DLLs. If the debugger is stubborn and doesn't want to load symbols because it thinks they don't match, you can use .symopt+ 0x40 to set the "load anything" flag.
If you're not sure if you've got all symbols, the lm command will list all modules in the process and their symbol load status. lm -v will also display additional data including paths, timestamps, and versions, which can help you go hunt down the right symbols from the build archive.
Type ~ to dump a list of all threads. The tilde is also a prefix for thread selectors at the beginning of commands. Two useful selectors are ~n to select thread n temporarily, and ~* to select all threads. ~s changes the current thread, so to switch to thread 4, you'd type ~4s.
r will dump the current registers, in case you missed it from the attach -- and as WinDbg helpfully tells you, .ecxr will switch to the context frame of the exception that occurred, if it was captured in a minidump. If you attached the debugger because the application appears to have deadlocked, !locks will dump out a list of critical sections currently held and which threads hold them. This usually identifies the reference cycle immediately.
The command to dump call stacks is k. Generally the first thing you should do after attaching to a process or loading a crash dump and then setting up symbols is to type ~*k to dump call stacks for all threads. kb will also attempt to dump raw parameters for each call. Unfortunately, the call stack on an optimized x86 executable is often incorrect. Sometimes you can dump past a sticking point or uncover hidden calls by forcing a different starting stack address with =, e.g. k =12fc00.
db dumps bytes, dw dumps words, dd dumps dwords, etc. One of my favorite commands is dds, which dumps an array of dwords and attempts to decode each one as a pointer to a symbol. If I suspect that the automated call stack is incorrect, I do dds esp L20 and then try to reconstruct the correct call stack from any apparent return addresses in the output. It's also useful for finding vtable pointers of C++ objects. For strings, da dumps ASCII strings and du dumps Unicode ones.
When you have a valid call stack, you can use .frame to switch between the entries on the call stack, and dv to dump local variables... or you could just use the WinDbg UI for that. ? evaluates an expression using the default expression syntax (usually MASM), and ?? evaluates using C++ syntax. I prefer to use ?? instead of the watch window, because WinDbg unfortunately has a habit of checking symbols over the network a lot when evaluating expressions, and if you use the watch window it can repeatedly hang the debugger for long periods of time, whereas with ?? you have more precise control over when it happens.
Breakpoints can be set with bp, cleared with bc, and listed with bl. These are all boring PC-based breakpoints, though -- use ba to set data breakpoints, which IMO are highly underrated. t traces, and g goes (resumes execution). Ctrl+Break interrupts the application again. WinDbg doesn't bring the app to the foreground on resuming execution like Visual Studio does, which can avoid annoying loops where every time you hit go, it immediately repaints and hits your breakpoint again.
After you've gotten all the information you can locally, there are a few options. You can just give up with q, or you can try again with .restart, if you started the debuggee under WinDbg. If the program you attached to is just hung, you can use .detach and then attach Visual Studio to it. You can also use .dump to create a minidump for further analysis -- this is especially helpful since you can load the minidump offline in either WinDbg or Visual Studio, and if you create a "big minidump" with .dump /ma, you can pretty much see and do just about anything in the dump that you could locally, short of actually resuming execution.
That's about all I can think of for WinDbg essentials, although of course it comes with a nice help file that details everything I missed here. I highly recommend looking into WinDbg and adding it to your debugging arsenal, especially if you're one of those people who has to diagnose crash dumps sent by users. The last point I'd like to make is that if you're working on Windows XP, the NT system debugger (NTSD) is the ancestor of WinDbg; it's missing a lot of features in comparison, but it's always there. If you find yourself in a pinch and don't even have WinDbg available, you can attach ntsd.exe to the failed process, using -p to specify the process ID and -pv to force a non-intrusive attach if necessary, and then write a minidump with .dump.(Read more....)