Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ How not to optimize with assembly language

Time to imitate Alex Papadimoulis.

While researching some algorithms on the web, I saw a forum poster "optimize" a routine written in a high-level language into assembly language. I won't post the whole function, but here's part of it:

; s1 = seed & 0xFFFF
    xor   eax, eax
    add   eax, seed
    mov   ebx, eax
    and   eax, 0FFFFh
    mov   eax, ebx
    mov   s1, eax
; s2 = (seed / 65536) & 0xFFFF
    mov   eax, seed
    mov   ebx, 10000h
    xor   edx, edx
    div   ebx
    and   eax, 0FFFFh
    mov   s2, eax

Now, I like assembly language, and the asm code is indeed correct compared to the original high-level code. However, it's a perfect example of a bad use of assembly language to optimize. Looking at it, there are a ton of missed opportunities, such as changing the divide to a shift, removing the and operation that's a no-op, and so on. Well, except that the statement immediately before the block is this:

    mov   seed, 1

Basically, the fragment above computes two constant expressions -- and no, there isn't a branch target in between. And actually, due to a bug, the code wouldn't actually work otherwise. (Hint: Errant mov.) Makes you wonder why the coder didn't just use this:

    mov   s1, 1
mov s2, 0

The rest of the translated function, by the way, was equally faithful and awful. The worst part was that the original source had a big comment block indicating how the algorithm could be easily sped up by an order of magnitude, which was of course ignored.

If you're going to go to assembly language for speed, your job is to optimize based on knowledge that the compiler lacks, such as restricted values in data, and not to act as a really slow compiler with no optimizer. Merely translating a routine verbatim into asm (and in this case, really bad asm) is a total waste of time.

Comments

Comments posted:


True, today even to compete with the optimizers, you need to have a very good grasp of asm and computer logic. Even switching certain small operations around so that you get the most cache hits can greatly speed up a procedure (this also applies to non-asm).

Now I just wish you could release your [very optimized] scaler routines (liniear/cubic/etc...) under Public Domain, or at least the BSD license so that they can be integrated into commercial applications.

Blight - 23 07 06 - 06:50


Er. Every production compiler these days will optimize out constant expressions, and every one will optimize a divide by 65536 to a shift (if the dividend is unsigned). The first step of assembly optimization: to look at the compiler's output ...

But "bad optimization strategies" is a well-beaten horse, and I think we're just picking on some random Intro to Assembly student. :)

Good optimization strategies are more interesting. For example, I've frequently dealt with code with intermittent hotspots, such as frame skips in a game, due to some piece taking longer than usual under some circumstances. I've long wondered about strategies for diagnosing this; I end up lacing the code with timers, which works but is very time-consuming. It's even worse when a slow function in one thread manifests as a skip in another thread (eg. mutex or CPU contention). Profilers are useless for all this; they tell you who takes the most time on average, but can't tell you about a function that takes 2ms most of the time and 10ms once every two minutes.

I want a profiler which I can programmatically control: profile particular blocks of code in detail, or "record the last interval of samples in full detail, and show me how it differs from the average" after detecting a skip. Given how few really good profilers there are (both in Windows and Linux), that seems like a pipe dream ...

Glenn Maynard - 23 07 06 - 14:43


@Blight:
I've worked on proprietary applications before, so I'm not opposed to the idea, and I have also relicensed portions of VirtualDub's source code before under less restrictive licenses on request. However, I am not in the business of providing libraries for commercial software, and I would need some incentive to relicense a major piece of code like the resampler. And no, getting my name out there is not enough -- someone tried that suggestion already. :)

If you want to write your own resampler, I can explain the theory and algorithms behind it, including the optimization techniques. There are some optimization techniques that I haven't applied to the existing code due to maintenance reasons.

@Glenn Maynard:
Everything is fair game in assembly language. :)

The simplest kind of profiler you could use for finding hitches is ye old frame graph -- one color bar per subsystem, stacked vertically with cumulative height being frame time, and scrolling horizontally once per frame. That lets you spot the hitches visually, and then to actually diagnose it you use a profiler that can log to a buffer that either gets dumped or saved each frame depending on how long the frame took.

Now, as for the instrumentation problem, _penter/_pexit comes to mind... the major downside being that they kill all inlining in the application, and thus massively slow it down. You can, however, do some neat tricks with those hooks, namely having _penter backpatch the jmp to which hook you want that function to call or even killing it, so that you don't have to dispatch in _penter. Another option is to get an instruction-level hook library like Detours, and use it to selectively instrument functions by name in the binary based on the symbol tables. That way you can hook uninstrumented functions, as long as they have a fairly tame prolog.

Phaeron - 23 07 06 - 15:42


If you want a reason to release it permissively, perhaps: to fill the gap, and let people stop reinventing that wheel. I recently had to write an audio resampler from scratch, merely because I couldn't find a permissively-licensed, reasonably fast, reasonable quality one, despite the fact that it's been done a hundred times. (They're all either copyleft or proprietary.) Image resizing (and conversion) has also been done and done again; a solid permissively-licensed implementation would let people stop doing that. (That's my philosophy, anyway, why I prefer to release reusable bits permissively: deliberate rewrites have their place, when code is unsalvagable, but license-forced rewrites are a waste.)


I have helpers for finding the frame skips themselves; that part is easy.

Every software profiler I've used slows things down substantially, which itself is a problem for this case: profiling frames that skipped at 60fps is much harder if the profiling is causing it to run at 20fps to begin with.

I havn't seen any profilers that can save per-frame logs (except some proprietary non-x86 hardware profilers), though the only ones I've used recently are gprof and VTune. If there's a profiler out there that won't slow the app down so much (even if it requires SMP), or can let me say "begin logging; end logging; discard last log/save last log", I'd love to know about it. (Even better: "begin; end; accumulate to log 2", to accumulate separate averages for the normal and skip cases.)

Glenn Maynard - 23 07 06 - 16:33


Over the years, I have released quite a bit of open-source code, but always under Public Domain. The reason (for me) was that certain bits of code that I wrote would benefit others while I myself don't care how it would get used. Especially super-optimized code where there's not much room for improvement, so expecting people to improve the code and send it back to me is not a major issue.

Also, it was up today on slashdot, so I think I'll mention it as well... The more optimized code you have out there, means the CPU works less, less energy is being used by the entire system, that engery is saved at the power plant and eventually, the entire planet uses less raw materials.

So, if my code "x" gets used by several applications, the overall global impact could be that I saved "x" tons of coal from being burnt. It's a round-about reasoning, but it's happening. Also, bringing up the level of all applications, commercial or otherwise is a good thing for the end user.

But were not leeches, I'll eMail you with a question directly.

Blight - 24 07 06 - 05:22

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.