Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ They called _what_ in the inner loop??

AMD just open sourced the AMD Performance Library as Framewave, which at least from my perspective seems like a good thing. Not that I'm going to attempt to use it, but I perused the source out of curiosity, and it looks like there are some useful goodies in there.

And then there's some... marginal stuff.

One thing that I wanted to look at was their 8x8 2D-IDCT source. The 8x8 2D inverse discrete cosine transform (IDCT) is popular and used in a number of video compression formats. There are a million ways to implement it quickly, and although everyone's seen Intel's AP-922 SSE2 algorithm for it by now, I hadn't seen one by AMD before. So I grab the source and dig around in the JPEG module, and I see this:

int IdctQuant_LS_SSE2(const Fw16s *pSrc, Fw8u *pDst, int dstStp, const Fw16u *pQuantInvTable)
{
... pedx = (Fw16s *) fwMalloc(128); //64 array of Fw16s type

Who the #*@&*( calls malloc() in an optimized IDCT routine???

It looks like there are indeed a number of well-optimized SSE2 routines in the Framewave library, but after seeing things like the above a few times I was left scratching my head a bit....

Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable.

Comments

Comments posted:


Well - their library seems seriously to lack optimization for static operations. Their resizers are unable to maintain their coefficients, so no possibility of re-use.

A much bigger WTF is that their resizers split the image without overlap, so you will get no incorrect interpolation, where the library. You might also enjoy to know that they call 4 mallocs per thread per resize. 16 bit resizes converts to and from float in the inner loop. Wow. ;)

Maintaining this beast seems horrible. The 8 bit unsigned _cubic_ resizer alone is 1488 lines of code - yes 1488. It seems they just gave up on lanczos and just implemented it in plain C++.

Colorspace conversions have no interpolation, whatsoever. No interlaced 4:2:0 RGB - nice - you have to separate fields - but considering their lack of interpolation that shouldn't be much of a concern.

These are the only ones I've looked at - and honestly - I would only use it for prototyping, since it's easy to use and (presumably) well tested.

Klaus Post (link) - 21 02 08 - 06:10


is fwMalloc==malloc ? Are they using a custom preallocated memory pool to ensure locality or something?

Nick S - 21 02 08 - 06:23


It would've been nice to actually explain what's the problem with calling malloc and what would have been a better implementation.

Calvin K - 21 02 08 - 07:11


Until we see what fwmalloc() actually does there is no valid point in bashing it. If it's using some sort of fast pool-allocator I don't see anything wrong with it.

Bojan - 21 02 08 - 07:55


The little I digged it seems that this is as follows:

void* STDCALL fwMalloc (int length){return AlignedMalloc ( length );}

where AlignedMalloc is platform-specific:
on windows it is return _aligned_malloc( length, 32 );
on OS-X void *memptr = malloc ( length );return memptr;
on Linux void *memptr; posix_memalign ( &memptr, 32, length ); return memptr;
and on Solaris void *memptr = memalign ( 32, length ); return memptr;

So it seems that they all call memory-allocators that return an aligned piece of memory. Seems to be quite wasteful indeed to call it in a otherwise tight routine but I do not known anyway nearly enough to say how else you can get aligned memory.

Jarkko Miettinen - 21 02 08 - 09:42


how else you can get aligned memory
=> allocate an oversized buffer and use only the aligned middle of it.

Tobias Rieper - 21 02 08 - 10:03


Tobias: Presumably, that's exactly what the system calls do, too.

Jake McArthur (link) - 21 02 08 - 10:24


I've seen stuff like that pulled only in research code so far. Anything you're going to use often that needs temporary buffers should allocate it once, in advance, or at least be gracious enough to let you specify the temp buffer manually to avoid crap like that.

Stefan - 21 02 08 - 11:18


malloc() is really expensive, for many reasons. It's an external routine, it may have to lock, and it has to deal with requests for variable memory sizes. In a case like this, where you have a fixed-size allocation with local scope, you can handle the allocation privately much, much faster than any sort of general purpose allocator.

I suppose it's not obvious to people who haven't written this kind of routine before, but for a relatively small buffer like this it's much faster to just allocate it on the stack as a local variable. There are a number of advantages, including: allocation failure case is tightly controlled (generally impossible), requires minimal code, the memory is likely already in L1 cache, you can control alignment against other buffers to reduce cache line address aliasing, and you don't need an extra register to point to it. C++ compilers generally provide a way to enforce alignment on local variables, such as __declspec(align(n)) in VC++, and in C# you can use stackalloc() in an unsafe routine. You can make the allocation call hurt less by implementing a fast allocator, but there's no way you'll beat stack allocation, even with a super-fast thread-local implementation or even a compacting garbage collector. The function call alone is significant in a routine as critical as an IDCT, which commonly executes in less than 2,000 clocks.

Phaeron - 21 02 08 - 22:53


I just took a look at the cubic and Lanczos resamplers that Klaus mentioned, and they're worse than he lets on -- in fact, they're implemented as direct 2D resamplers. For the cubic, this results in 16 multiplies / 16 adds per pixel, versus 8m7a for a row/column implementation. Memory bandwidth issues _might_ even that out. I have a hard time believing that might be faster for the Lanczos case, though, which uses a 6x6 kernel and where the difference is nearly 5:1 in ALU ops (64m64a vs. 12m11a).

Quality looks like it might be an issue as well. They're using pmullw to do the weighting, which means they had to use only 6 bits of fractional precision for the weights. (I don't remember what Avisynth uses, but VirtualDub uses 14 bits since it uses pmaddwd.) That's somewhat marginal for a 1D implementation, and really sketchy for a 2D kernel... and they're also converting the weights to fixed-point using truncation instead of rounding. They do at least renormalize the weights _after_ conversion to integer, which is a common mistake otherwise.

The data flow is also a bit complex for the 8bpp case. Rather than using slightly larger pre-aligned kernels, they use a scalar loop to copy the pixels into a temporary buffer and then load that into SSE2 registers. That's a bunch of load/store traffic and a series of store forwarding stalls. They seem to be doing this in order to compute 8 pixels in parallel and not have to do a horizontal add, but I wonder if this is actually faster than using pmaddwd and then a four-way, 32-bit horizontal add.

I'd like to point out that it wasn't my intention to rip on the Framewave library in general, just on the particular IDCT routine. It does look like a number of routines in the library are underoptimized, however. I think one goal of the library is to be API compatible with the Intel Performance Primitives (IPP) library, though, so we should give leeway where the original Intel API lacks flexibility. Does anyone know how well Intel IPP fares in comparison?

Phaeron - 22 02 08 - 00:37


For the record, Avisynth also uses 14 bits coefficients. I rewrote the vertical resizer to a dynamic SSSE3, which uses pmulhrsw, so the accumulator faction is only 6 bits. It has fallback to MMX if there is more than 4 taps.

Klaus Post - 22 02 08 - 15:41


"Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable."
Though I can understand why this was needed. x64 does not support inline assembly.

Yuhong Bao - 27 02 08 - 20:41


"Does anyone know how well Intel IPP fares in comparison?"

I haven't tested those particular functions but I hear there is a free trial of IPP :)

Igor Levicki (link) - 20 04 08 - 23:56

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.