§ ¶How not to optimize with assembly language
Time to imitate Alex Papadimoulis.
While researching some algorithms on the web, I saw a forum poster "optimize" a routine written in a high-level language into assembly language. I won't post the whole function, but here's part of it:
; s1 = seed & 0xFFFF
xor eax, eax
add eax, seed
mov ebx, eax
and eax, 0FFFFh
mov eax, ebx
mov s1, eax
; s2 = (seed / 65536) & 0xFFFF
mov eax, seed
mov ebx, 10000h
xor edx, edx
div ebx
and eax, 0FFFFh
mov s2, eax
Now, I like assembly language, and the asm code is indeed correct compared to the original high-level code. However, it's a perfect example of a bad use of assembly language to optimize. Looking at it, there are a ton of missed opportunities, such as changing the divide to a shift, removing the and operation that's a no-op, and so on. Well, except that the statement immediately before the block is this:
mov seed, 1
Basically, the fragment above computes two constant expressions -- and no, there isn't a branch target in between. And actually, due to a bug, the code wouldn't actually work otherwise. (Hint: Errant mov.) Makes you wonder why the coder didn't just use this:
mov s1, 1
mov s2, 0
The rest of the translated function, by the way, was equally faithful and awful. The worst part was that the original source had a big comment block indicating how the algorithm could be easily sped up by an order of magnitude, which was of course ignored.
If you're going to go to assembly language for speed, your job is to optimize based on knowledge that the compiler lacks, such as restricted values in data, and not to act as a really slow compiler with no optimizer. Merely translating a routine verbatim into asm (and in this case, really bad asm) is a total waste of time.
Comments
Comments posted:
True, today even to compete with the optimizers, you need to have a very good grasp of asm and computer logic. Even switching certain small operations around so that you get the most cache hits can greatly speed up a procedure (this also applies to non-asm).
Now I just wish you could release your [very optimized] scaler routines (liniear/cubic/etc...) under Public Domain, or at least the BSD license so that they can be integrated into commercial applications.
Blight - 23 07 06 - 06:50
Er. Every production compiler these days will optimize out constant expressions, and every one will optimize a divide by 65536 to a shift (if the dividend is unsigned). The first step of assembly optimization: to look at the compiler's output ...
But "bad optimization strategies" is a well-beaten horse, and I think we're just picking on some random Intro to Assembly student. :)
Good optimization strategies are more interesting. For example, I've frequently dealt with code with intermittent hotspots, such as frame skips in a game, due to some piece taking longer than usual under some circumstances. I've long wondered about strategies for diagnosing this; I end up lacing the code with timers, which works but is very time-consuming. It's even worse when a slow function in one thread manifests as a skip in another thread (eg. mutex or CPU contention). Profilers are useless for all this; they tell you who takes the most time on average, but can't tell you about a function that takes 2ms most of the time and 10ms once every two minutes.
I want a profiler which I can programmatically control: profile particular blocks of code in detail, or "record the last interval of samples in full detail, and show me how it differs from the average" after detecting a skip. Given how few really good profilers there are (both in Windows and Linux), that seems like a pipe dream ...
Glenn Maynard - 23 07 06 - 14:43
@Blight:
I've worked on proprietary applications before, so I'm not opposed to the idea, and I have also relicensed portions of VirtualDub's source code before under less restrictive licenses on request. However, I am not in the business of providing libraries for commercial software, and I would need some incentive to relicense a major piece of code like the resampler. And no, getting my name out there is not enough -- someone tried that suggestion already. :)
If you want to write your own resampler, I can explain the theory and algorithms behind it, including the optimization techniques. There are some optimization techniques that I haven't applied to the existing code due to maintenance reasons.
@Glenn Maynard:
Everything is fair game in assembly language. :)
The simplest kind of profiler you could use for finding hitches is ye old frame graph -- one color bar per subsystem, stacked vertically with cumulative height being frame time, and scrolling horizontally once per frame. That lets you spot the hitches visually, and then to actually diagnose it you use a profiler that can log to a buffer that either gets dumped or saved each frame depending on how long the frame took.
Now, as for the instrumentation problem, _penter/_pexit comes to mind... the major downside being that they kill all inlining in the application, and thus massively slow it down. You can, however, do some neat tricks with those hooks, namely having _penter backpatch the jmp to which hook you want that function to call or even killing it, so that you don't have to dispatch in _penter. Another option is to get an instruction-level hook library like Detours, and use it to selectively instrument functions by name in the binary based on the symbol tables. That way you can hook uninstrumented functions, as long as they have a fairly tame prolog.
Phaeron - 23 07 06 - 15:42
If you want a reason to release it permissively, perhaps: to fill the gap, and let people stop reinventing that wheel. I recently had to write an audio resampler from scratch, merely because I couldn't find a permissively-licensed, reasonably fast, reasonable quality one, despite the fact that it's been done a hundred times. (They're all either copyleft or proprietary.) Image resizing (and conversion) has also been done and done again; a solid permissively-licensed implementation would let people stop doing that. (That's my philosophy, anyway, why I prefer to release reusable bits permissively: deliberate rewrites have their place, when code is unsalvagable, but license-forced rewrites are a waste.)
I have helpers for finding the frame skips themselves; that part is easy.
Every software profiler I've used slows things down substantially, which itself is a problem for this case: profiling frames that skipped at 60fps is much harder if the profiling is causing it to run at 20fps to begin with.
I havn't seen any profilers that can save per-frame logs (except some proprietary non-x86 hardware profilers), though the only ones I've used recently are gprof and VTune. If there's a profiler out there that won't slow the app down so much (even if it requires SMP), or can let me say "begin logging; end logging; discard last log/save last log", I'd love to know about it. (Even better: "begin; end; accumulate to log 2", to accumulate separate averages for the normal and skip cases.)
Glenn Maynard - 23 07 06 - 16:33
Over the years, I have released quite a bit of open-source code, but always under Public Domain. The reason (for me) was that certain bits of code that I wrote would benefit others while I myself don't care how it would get used. Especially super-optimized code where there's not much room for improvement, so expecting people to improve the code and send it back to me is not a major issue.
Also, it was up today on slashdot, so I think I'll mention it as well... The more optimized code you have out there, means the CPU works less, less energy is being used by the entire system, that engery is saved at the power plant and eventually, the entire planet uses less raw materials.
So, if my code "x" gets used by several applications, the overall global impact could be that I saved "x" tons of coal from being burnt. It's a round-about reasoning, but it's happening. Also, bringing up the level of all applications, commercial or otherwise is a good thing for the end user.
But were not leeches, I'll eMail you with a question directly.
Blight - 24 07 06 - 05:22