Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Introducing "warp resize"

In an earlier blog entry on video shaders I introduced an algorithm called warpedge as an example. It's a hybrid warpsharp+resize algorithm that attempts to make edges as crisp as possible. The algorithm's output has proven interesting enough that I made a VirtualDub video filter out of it:

Warp resize, version 1.1
Warp resize, version 1.1 source code

Only works when enlarging a video, and requires a lot of CPU power. Read on for the algorithm description.

The basic "warp sharp" algorithm

Warp sharp is the name of an algorithm that I originally found coded as a filter for an image editing application called "The GIMP." It's based on the idea that if you can identify the edges in an image, you can warp the image to shrink the edges and thus make them appear sharper. I don't remember how the original code worked, but here's one way to implement such an algorithm:

Problems with warp sharp

The first problem with warp sharp is that the length of the displacement map vectors is dependent upon the height of the bump map, which is in turn dependent upon the contrast of the original image. This means that the amount of narrowing of edges varies with local contrast, which produces uneven edges.

A second problem has to do with the warp. If the warp is done using straight interpolation, the algorithm will never output a pixel that is not in the infinitely interpolated version of the original image. In practice this means you can get "bumpiness" in edges the warped image, since the warp interpolator doesn't do edge-directed interpolation. You can often see this in near-horizontal or near-vertical lines, where the interpolator creates gray pixels when the line straddles a scan line boundary. Unfortunately, while there are edge-directed interpolation algorithms, they usually don't produce continuous output. For example, the Xin-Li New Edge Directed Interpolation (NEDI) algorithm only produces a rigid 2:1 enlargement. This doesn't help with a displacement-map based warp, which requires fine perturbations in the sampling location.

A third problem is that since warp sharp essentially narrows transition regions, it has a tendency to crystallize images into discrete regions such that it resembles a Voronoi diagram. Blurring the bump map decreases the filter's sensitivity to noise and helps this somewhat.

The warp resize algorithm

To solve the luminance sensitivity, warp resize normalizes the gradient vectors produced in the second pass. This has the downsides of amplifying noise and failing on (0,0) vectors, so the normalization scale factor is lerped toward 1.0 as the vector becomes very short. Also, the gradient vectors are computed on the full-size version of the image, so that the normalization occurs after resampling. Otherwise, the resampling operation would denormalize gradient vectors, which would introduce artifacts into the warp on original pixel boundaries.

The issue with the warp interpolation is corrected using the following ugly hack: We know that the problem becomes worst between pixel boundaries, where the interpolator is "most inaccurate" with regard to edges. So what we do is compute the difference between the warped pixel and the interpolated pixel, and re-add some of that difference to emphasize it. The difference is greater where the slope of the edge is higher, and thus this tends to flatten out edges and make borders crisper.

Throw in a liberal amount of high-powered, slightly-expensive interpolation to taste.

Thus, the algorithm used by the CPU version of the algorithm is as follows:

The source code is a bit of a mess because it interleaves the first six passes together as a bank of row filters in order to increase cache locality. If you are looking into reimplementing the algorithm, I highly recommend either working from the description above or starting with the HLSL code for the original GPU filter, which is much easier to understand.

As you might guess, the warp resize algorithm is better suited to cartoon-style imagery than natural images, although it can help on the latter too. There is also an issue with the filter sometime introducing a gap in the center of edges, which splits edges into two; I haven't figured exactly what causes this, but it is sometimes caused by the displacement vectors being too strong. What's weird is that the effect is sometimes very limited, such as a hair-width gap being seen in a 20-pixel wide edge after 4x enlargement. I think this may have to do with the gradient vector swinging very close to (0,0) as the current pixel position crosses the true center of an edge, at which point the filter can't do anything to the image.

The CPU version is approximately one-sixth of the speed of the GPU version, even with MMX optimizations. This is comparing a Pentium M 1.86GHz against a GeForce Go 6800, so it's a little bit unbalanced; a Pentium 4 or Athlon 64 would probably narrow this gap a bit. Also, the GPU version does nasty hacks with bilinear texture sampling hardware to get the final three passes running fast enough and with a low enough instruction count to fit in pixel shader 2.0 limits, so its accuracy suffers compared to the CPU version. If you look closely, you can see some speckling and fine checkerboard patterns that aren't present in the output of the CPU filter. Increasing the precision of the GPU filter to match would narrow the gap further.


Comments posted:

Again, being my 2nd post since you del my 1st, I assume it was to long and boring ;).

Anyways you can increase the gpu presision, all the up to there's absolutly no diffrence between using a software solution, or a hardware solution.

Now if you are using an ati card, you'r out of luck.

How ever, if you're using a nvdidia card, you are in luck.

You setup the mip map bais to a -15, force trilinear, force no optimizations for trilinear and etc.
That's all there is toit
Or you can use poit based filterign instead of trilinear, but stuff way way away from you in say a3d env, will look sorta nasty because of all the little dots...
AF filtering is useless.

I won't bother explaining on how to doit again, you del it last time.

Anyways there is another way.
Disable mip mapping completely.
This is alot slower then the above, it's the same reults as using point filtering and a -15 mip map base.

Btw, I suggest using a quadro for this, as using a geforce can get unwanted results sometimes(Still good results, but not allways 100% perfect).

Just render as a texture, use the gpu, etc etc
This is the basics, apply your code to the texture and let the above doit it's job.

One more thing, disable the damned overlay before doing anything...

NEO - 23 12 05 - 19:28

Sorry, but a two-page registry patch doesn't really fit here. :)

Now, as for tweaking the parameters:

Setting mipmap LOD bias to -15 essentially forces level 0 in all but the most extreme cases (>32K:1 pixel-to-texel density). At that point you don't even have mipmapping, much less trilinear, so you will take similar performance hits for thrashing the texture cache at far distances. While this does give sharper textures at moderate distances, you will get aliasing farther away, which generally looks worse ("the sparklies"). I would highly recommend NOT doing such an extreme tweak. Now, this assumes that the 3D applications are using good mipmaps, of which box filters do not make; but high-quality mipmaps with anisotropic enabled should be able to beat turning off mipmaps.

Anisotropic is most certainly not worthless. It is useless when you are looking directly at a surface, as the aspect ratio of the sampling area is 1:1 in that case, but when you are obliquely looking at a surface, anisotropic filtering can significantly increase the quality of the texturing without incurring aliasing of negative LOD bias or the blurring from trilinear alone. I won't claim that NVIDIA's or ATI's aniso is that great (one is not very powerful and the other is way too angle dependant), but you can't say that 8:1 on a road surface isn't noticeable.

The precision issues I'm talking about, though, would be completely unaffected by changing those settings. VirtualDub has no need for mipmaps when blitting video and thus neither mipmap LOD bias nor trilinear have any effect. What does affect the quality is the number of bits of subpixel precision in the bilinear interpolator and the depth of the textures used for the bicubic lookup tables. The lookup tables can be improved by switching to FP16 or FP32 textures, but the subpixel precision is a hardware limitation. Software rasterizers clearly win here as they regularly have 8-16 fractional bits whereas the hardware may only have 4-8. It is possible to do a higher quality bilinear fetch through manual interpolation, but it's expensive (4 texlds, 1 add, 3 lerps).

Phaeron - 23 12 05 - 20:02

Can the GPU filters run within the plugin subsystem? What i really mean is, can these image filters render out during processing using the GPU?
Having the GPU doing all the hard work preprocessing the video, then the CPU doing the hard work encoding would make the whole process go a lot faster, one would think.

Matariel - 26 12 05 - 13:30

Definitely possible, but not at the moment, and it would take more work to implement. Doing high-speed readback requires some double-buffering tricks, and some additional logic is required to ensure that a temporarily lost device doesn't trash the output video.

Whether or not using the GPU would be faster is a different question. For many image processing tasks the GPU is unquestionably faster, but this has to be balanced against the CPU overhead of managing the GPU and the overhead of pulling data on and off the card. CPU overhead for managing the GPU is significant with Direct3D, unfortunately. The readback situation used to be pretty bad but has gotten significantly better in recent years; AGP cards now regularly get 100MB/sec+ and PCIe cards can go far above that.

Phaeron - 28 12 05 - 18:55

A very stupid, very unrelated question regarding Pentium-Ms:

How comes that, at most, they will execute 2 instructions per cycle? According to Intel documentation, they should be able to do 3 like the P-4s; but I have tried very hard, and while it's easy for me to write trivial, not-doing-any-real-work code that achieves more than 2 instrucions/cycle on a P-4 or a K7, the maximum I've been able to hit on a P-M has been 1.99 (that is, 2).

Of course it's still the fastest thing cycle-per-cycle; but this "non-real-world" limitation seems strange. The docs say it has 3 decoders and can retire 3 instructions per cycle...

As a note of curiosity, measured using the built-in performance counter, it gets 1.6 instructions retired per cycle when encoding to XVID (defult settings) from MPEG1 using VirtualDub :D

PM - 02 01 06 - 22:24

It is possible, but tricky. The key is that only two execution units are capable of doing ALU operations; the other three units are the load unit, store unit, and store address unit. That means the only way to attain 3 insns/clock on a Pentium M is to have at least one load or store operation in each triplet. For instance, MOV EAX, [ESP] / ADD EBX, EBX / ADD ECX, ECX should decode and execute in one clock.

I believe there is also a one-clock penalty for running across instruction fetch blocks, though (16 bytes). Also, the retirement stage that writes results back into architectural registers can only retire three micro-ops per clock, so the you also need all three of those instructions in each triplet to be one micro-op each.

Agner Fog's optimization tome at is my original source for this very exact and hard to find information. It doesn't directly cover the Pentium M, but the execution architecture is still very similar to the PPro/PII/PIII.

Phaeron - 02 01 06 - 22:56

Thanks for the info, that's a really excellent paper. I won't be able to test it until Sunday, but I could swear I tested code with load/store instructions in various fashions...

Also, I tried with just NOPs, and "normal" code (wich already got 1.99 IPC on P-Ms and almost 3 on P-4s) with interleaved NOPs. Both hit 1.99 IPC on the P-M; so all decoding units can decode them and I don't think they need any execution units :D

PM - 03 01 06 - 19:13

Nope, they do. NOPs are listed executing in either port 0 or 1 on P6 architectures, and I'm seeing this on my own Pentium M. I can think of a number of reasons why they couldn't have been absorbed in the decoding stage, such as a page fault on the instruction fetch, breakpoint on the NOP, or TF=1.

Note that the Pentium M is improved in several ways over the P6, which obsoletes some of the data in Agner's document. For instance, the PIII is execution bottlenecked at 2 clocks/insn when executing a string of independent ANDPS instructions, but the P-M can execute them at one per clock.

Phaeron - 03 01 06 - 22:11

Yeah - I'm hitting 2.74 IPC now :D

I made some wrong assumptions - in fact I didn't know ports 0 and 1 handled FP/MMX/SSE too... shame :)

Thanks. Oh! And K7s do eat NOPs at 3/cycle on the decoding stage - I just assumed everybody else would do that too... They are "optimized" on Intels nevertheless, since thay are faster than "XCHG EAX, EAX"s (1 uop vs 3)

PM - 08 01 06 - 18:23

Comment form