A buddy's been bugging me to release this, and I don't have any VirtualDub news at the moment, so what the heck. :)
Introducing Altirra, my emulator for Atari 8-bit computer systems:
(Update: Links to version 1.0 removed -- superceded by current version.)
Why did I write this? Well, I grew up with a number of 8-bit computer systems, and the one that I liked the most was the Atari 800. The hardware design seemed the most versatile, and furthermore, it was the hardware predecessor to the Amiga, another one of my favorite systems. One day I got struck with a particularly bad case of nostalgia for it, without access to my real Atari 800. Normally I would have just launched Atari800Win, but for some reason I wanted to write my own emulator instead, since I'd never done a real one before, and this eventually was the result. Those of you who have followed VirtualDub's evolution know I like to do everything myself and in my own way. During one phase of its development, I was told: "You have the worst case of feature creep I've ever seen."
Altirra is not the most polished or complete Atari 800/800XL/130XE emulator by far, but it does run quite a lot of software. The main game I wanted to get running was The Last Starfighter, a.k.a. Star Raiders II, which has cool music and some pretty good graphics -- and turned out to be rather complicated to emulate, due to a bankswitched ROM, a complex display list, and lots of display list interrupts. I have an interest in old computer hardware, and I like to find out how hardware actually works instead of just what it does or what it was supposed to be used for. When writing Altirra, I tried to understand how the original hardware was constructed, and craft the emulator in the same fashion. As a result, it's pretty close to cycle exact and supports most of the features of the hardware. It isn't quite up to the level of Atari800Win, which I consider the gold standard, but most of the demos and games I've thrown at it work. It even emulates a few things that A8W doesn't, such as the SIO disk transfer beeping that I remember a lot from my childhood.
Now, the reason why I was asked to release this emulator? Definitely not for the UI, which sucks. The reason is protected disk support: Altirra can boot from APE (.pro) and VAPI (.atx) disk images, which preserve the original copy protection on software. It even supports SIO (serial input/output) call acceleration and burst I/O with protected disks, which means it can boot them very quickly. At the time, neither format was documented and the patched Atari800Win version with VAPI support could not do accelerated reads, so had I released this earlier, it would have been a first; I believe the VAPI format documentation has since been released, so I guess it's a bit late. Still, it was interesting to see what had been done to inhibit copying, such as phantom sectors, CRC errors, and missing sectors, and to see the unmodified games boot. At one point I thought it would be cool to hook up Altirra's disk emulator to a serial port and an SIO2PC cable to see if the emulation was good enough to boot a real Atari, but I never got around to it.
You might laugh when I say that I wrote Altirra for educational purposes, but I swear it's true. I learned more about the Atari than I ever thought I would, particularly all of the undocumented hardware quirks. Among the goofball issues I had to debug were:
- Getting DOS 2.0S to boot with a custom SIO implementation (it replaces serial interrupt vectors and still calls through SIOV)
- A program that read from an unused hardware address and then wrote that value to another port to clear interrupts
- A game that set the display list interrupt flag on a "wait for vertical blank" byte (which causes the DLI to fire every scanline until VBLANK)
- A demo that changed the DLI vector in HBLANK without disabling interrupts and used RTS instead of RTI to exit the handler, leading to crazy control flow
- A demo that alternated the vertical scroll register between two illegal values to get a low-overhead, quarter-resolution 9-color display
- A demo that displayed eight player sprites on a scanline by doing five mid-scanline register writes with exact cycle timing
I have to say, though, that although the demos were an absolute pain to get working, the things that some of the folks in Europe did in the later years on the Atari were utterly mindblowing, like real-time bump mapping, 3D texture mapping, and digital audio. On a 1.79MHz 6502.
To get the information I needed to write the emulator, I spent a lot of time reading the Atari hardware manual and poring over a gate-level chip schematic of POKEY that had been released to the public by the Atari Historical Society. I also ended up disassembling the Atari 810 disk drive control ROM at one point, because I needed to know the exact rate at which it bit-banged out data over the SIO bus and what head stepping parameters it fed to the floppy drive controller, and analyzing the schematics of the Atari 410 tape drive to replicate how it read tapes (it's two bandpass filters and a comparator glued onto the SIO bus).
It's doubtful that I'll work much more on Altirra, especially considering that I haven't really worked on it for months. One of the last things I had been doing was trying to enhance the source-level debugger and cassette support, because I had the idea that it would be possible to write a Guitar Hero or Rock Band clone on the Atari using the tape drive, since it supports an audio track... but of course, that never went far. Working on something like this, you realize quickly that, no, those weren't really the good old days... not when load times were measured in minutes and writing anything really cool meant debugging some really hairy 6502 assembly language. Heck, Altirra essentially amounts to the equivalent of a hardware in-circuit emulator (ICE), and it's still a pain to track down bugs in a machine language program. So, basically, I'm throwing it out there just to see if anyone else has the same curiosities as me and finds it interesting. Source is, of course, available under the GPL, and if you've looked at the VirtualDub source, much of it will look familiar as I used the same base libraries. Enjoy!(Read more....)
I decided to work a bit more on my shader compiler and emulator last night, and ran into some unexpectedly ugly problems with floating point specials.
The first problem had to do with the following innocent looking expression:
This returns the magnitude of a vector. The conventional way to compute this is to compute the squared magnitude by taking the dot product of the vector with itself -- the vector analog of squaring a scalar -- and then taking the square root of the result. It turns out that shader hardware doesn't actually support a direct square root. The reciprocal square root, 1/sqrt(x) is easier to compute, and it's also useful in some other cases, most notably normalizing a vector. In this particular case, though, we need the normal square root, and therefore we need to invert it in the assembly:
dp3 r0.x, r_vec, r_vec ;compute dot(vec, vec)
rsq r0.x, r0.x ;compute reciprocal square root of squared magnitude
rcp r0.x, r0.x ;invert reciprocal square root
This little code fragment has a major surprise lurking in it, which may not be apparent until you try optimizing it. Both rsq and rcp are scalar instructions, so in cases where multiple magnitudes are being computed, the temptation is high to replace 1/rsqrt(x) with x*rsqrt(x) instead to take advantage of the much more available mul instruction:
dp3 r0.x, r_vec, r_vec
rsq r0.y, r0.x
mul r0.x, r0.x, r0.y
This works just fine... until you have the misfortune of trying to compute the length of a zero vector. In that case, the reciprocal square root operation returns +Infinity, and then the next thing that happens is the computation 0*+Infinity, which then returns a NaN (invalid result). Suck. Therefore, for the general case, that rcp has to stay there.
The real gotcha comes when you try implementing the rsq and rcp instructions in software. Reciprocal is a very slow instruction on most FPUs, being done with the divide unit and usually taking dozens of nonpipelineable clocks. Reciprocal square root may not even exist in full precision form, and 1/sqrt(x) is a horribly painful way to implement it. If you want to implement this fragment quickly in SSE, you need to take advantage of the rcpps and rsqrtps instructions, which are very fast and work in parallel on four values. They only provide limited precision up to about 2^-12, though. The WPF shader engine just goes ahead and uses the approximation result directly, which works and is accurate enough for half precision (10 bit mantissa), but technically it's not Direct3D compliant as 22 bits of precision are needed.
The usual way to improve the accuracy of the reciprocal and reciprocal square root operations is by iteration through Newton's Method. For the reciprocal, it looks like this:
x = reciprocal_approx(c);
x' = x * (2 - x * c);
...and for the reciprocal square root, it looks like this:
x = rsqrt_approx(c);
x' = 0.5 * x * (3 - x*x*c);
Assuming that you have a good estimate, these will tend to double the number of significant digits per iteration, which means that just one iteration will give us pretty good precision, and quickly, too. And unfortunately wrong, as I discovered when I implemented it. The problem is once again zero. In order for the zero case to work, we need rsq to transform 0 -> Inf and rcp to transform Inf -> 0, but thanks to the x*c expression in both of these iterations, you once again get 0 * Inf = NaN. The way I fixed it was to insert a couple of carefully placed min/max operations in the iteration, although I'm not quite sure they're 100% correct.
The specials struck again when I was trying to optimize the code generated for a gamma correction shader. Gamma correction is primarily a power operation, which expands as follows:
pow(x, y) = exp(y * log(x))
If you actually try gamma correcting an image in this manner, you'll be waiting a long time for the result. For limited precision (8-bit), you can get away with a lookup table, but that doesn't scale well to higher precision or vector computations and definitely not to a shader where floating point is involved. Therefore, in order to get a faster version working, I had to implement the log and exp instructions, which compute the base 2 logarithm and exponential functions. SSE doesn't provide you any help to do this, so you're stuck implementing this from the ground floor. It's a bit like an old BASIC interpreter, except at least you start with add and multiply. This triggered the following conversation with a friend:
"What are you doing?"
"I'm implementing the log() and exp() functions."
"Doesn't the runtime provide those?"
"This is the runtime."
Anyway, I ended up implementing exp2(x) = c*floor(x) + f(x - floor(x)) and log2(x) = c*exponent(x) + g(mantissa(x)). For the most part they're not too hard, as long as you find good polynomial expansions and make sure it's exact at the right values, i.e. you don't want exp(0) = 1.004. In the end, I used a fifth-order polynomial for exp2() and a first-order Padé approximation with a change of variables for log2()... but I digress.
As you have probably already guessed, zero rears its ugly head again here, because 0^y becomes exp2(y * log2(0)), and for this to work you need log2(0) = -Infinity and exp2(-Infinity) = 0. A couple of well-placed min/max operations in the expansions once again fixed the SSE version, but I unexpectedly ran into problems with the x87 version. I hadn't bothered to optimize the x87 version, so it simply called into expf() and scaled the result. Since I compile with /fp:fast /Oi, expf() ended up getting expanded in intrinsic form like this:
fsub st, st(1)
If you look closely, you'll see that this computes (x - round(x)), which in this case is -Inf - (-Inf) = NaN. There are two basic ways to fix this. One is to force the runtime library version of the function, which is faster for SSE2 but unfortunately quite slow for x87. The other way, which is what I did, is to just check for infinity and special-case the result.
Ordinarily I don't think much about floating point specials, and the last place I would have expected to find them is in a pixel shader. I have to say it's been a bit of a humbling (and frustrating) experience.(Read more....)
For a long time, I've been a single-monitor user, mainly because I've usually only had one monitor to use. This has only been reinforced by my transition to laptops. I'd tried dual monitors on a desktop before, and didn't like it at first, but once I figured out that you can run two monitors without spanning them -- which often requires a reboot to get working since monitor detection in Windows is lame -- I've gotten accustomed to it. Span mode sucks with two monitors because everything tends to pop up exactly between the monitors. With plain old two-monitor mode, though, windows maximize on a per monitor basis and everything works a bit better.
Except when you enable or disable monitors.
Applications like to save their window location on exit, because users don't like to constantly have to resize the window on startup. (I grudgingly fixed this in VirtualDub a while back.) The tricky part is making sure that the application never ends up trying to restore a position off-screen, because it's almost impossible to get the window back. You're supposed to use SetWindowPlacement() to restore the window position, as it automatically adjusts the window position if this were the case. Frequently applications don't do this, though, and thus you get a bunch of likely failure cases, all of which result in the app restoring off-screen: the desktop resolution is lowered, a monitor is disabled or moved, or the app uses the wrong function and derives deep negative coordinates from a minimized window. As I tend to use Remote Desktop and do resolution testing a lot, I often hit these cases. WinAmp's a frequent victim.
There is a trick to rescuing such windows, though, as long as they appear on the taskbar or can otherwise be selected. I used to do Minimize All + Undo Minimize All (now Show the Desktop + Show Open Windows), but that wasn't very reliable. I've since found a better way to do it:
- Right-click on the window caption on the taskbar, or select it and use Alt+Space.
- If the Restore option is available, select it to pop the window out of minimized or maximized state.
- Choose the Move option.
- Hit an arrow key.
- Move the mouse.
The window should then pop on-screen and attach to the mouse, where it can then be dropped at a usable location via left-click.(Read more....)
The new VirtualDub Plugin SDK has been released and supercedes the old Filter SDK. You'll find it via the navigation links on the right, or on the main site, if you're reading this via a feed. This is intended for anyone writing plugins or even attempting to host plugins. The differences from the filter SDK are as follows:
- Describes plugin API functionality through the current version of VirtualDub (1.8.6).
- Contains new information about 64-bit plugins and input drivers.
- Reorganized and improved documentation in HTML Help (CHM) format.
- Includes project files for Visual C++ 6.0 and Visual Studio 2005 (easily convertable to VS2008).
- Includes new wrappers to make it easier to create video filters in object oriented form instead of procedural form.
If you've ever tried to write documentation like this before, you probably discovered like I did that it takes a lot of time. In fact, it often takes longer to document functionality than it takes to implement it, at least if you want to write docs that are a bit more useful and better formatted than slightly warmed over headers. Describing the meaning of symbols and calls in an API is one thing; actually writing coherent documentation that shows how they should be used to an end is another.
Another change in the documentation style is that the Plugin SDK omits most of the information from the Filter SDK about how to write optimized code, particularly the assembly/MMX portions. One reason is that it didn't fit well with the purpose of documenting the API, and another is that a lot of the advice in the filter SDK is outdated, given that it was written in the days of the 80486 and Pentium architectures, and it would take a lot of work to revise it. That isn't to say that the information isn't important, as the potential gains from vector optimization are as great today as they were years ago, but looking back at the old documentation I was dissatisfied with the treatment of the topic. That's probably a reflection of the way that I've evolved as a programmer over the years.(Read more....)