Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ And I thought my implementation of Deflate was bad

When I added PNG support to 1.7.0, I wrote my own routines to handle the Deflate compression algorithm. The Deflate algorithm is also the underlying compression algorithm behind the well-known 'zip' format, and is surprisingly complex, with a sliding window matching algorithm combined with three layers of Huffman trees. It was quite an educational experience to write a fast sliding window compressor.

Someone posted an interesting bug in the Visual Studio public bug database about the new zip file support enlarging files, so I decided to try .NET Framework's Deflate compressor:

using System;
using System.IO;
using System.IO.Compression;
namespace VCZipTest {
    class Program {
        static void Main(string[] args) {
            using (FileStream src = new FileStream(args[0], FileMode.Open)) {
                using (FileStream dst = new FileStream(args[1], FileMode.Create)) {
                    using (GZipStream def = new GZipStream(dst, CompressionMode.Compress, true)) {
                        byte[] buf = new byte[4096];
                        for (; ; ) {
                            int act = src.Read(buf, 0, 4096);
                            if (act <= 0)
                                break;
                            def.Write(buf, 0, act);
                        }
                    }
                    System.Console.WriteLine("{0} ({1} bytes) -> {2} ({3} bytes)", args[0],
src.Length, args[1], dst.Length); } } } } } D:projwinVCZipTestbinDebug>vcziptest f:shuffle.mp3 f:test.bin.gz f:shuffle.mp3 (1439439 bytes) -> f:test.bin.gz (2114778 bytes)

Yes, the .NET Framework implementation actually enlarged the file by 47%, and yes, gunzip was able to "decompress" the 2.1MB file correctly. I ran the file through VirtualDub's zip decompressor and found that the literal/length Huffman tree in the stream was pretty badly misbalanced, with most of the literal codes allocated 14 bits. Deflate should never enlarge a stream by more than a tiny fraction since it allows individual blocks to be stored.

In contrast, even "gzip -1" was able to compress the file from 1,439,439 bytes to 1,416,488 bytes, and gzip -9 got it down to 1,406,973 bytes.

Comments

Comments posted:


As per the MSDN docs:

"The compression functionality in DeflateStream and GZipStream is exposed as a stream. Data is read in on a byte-by-byte basis, so it is not possible to perform multiple passes to determine the best method for compressing entire files or large blocks of data. The DeflateStream and GZipStream classes are best used on uncompressed sources of data. If the source data is already compressed, using these classes may actually increase the size of the stream."

AndyC - 05 12 06 - 17:33


You are correct. However, a correctly functioning implementation of Deflate should never produce this amount of enlargement, even if it doesn't implement the stored block optimization, because the Huffman trees will only code what is used, and if there aren't any matches it simply won't allocate bits to those codes. In practice, just about any compressed file you grab will have a little bit of redundancy that Deflate can take advantage of to come out slightly ahead. Even if somehow there were a case where there were no matches and a completely balanced distribution, you'd still be able to build a literal/length tree consisting of 254 literals at 8 bits, one literal at 9 bits, and the end code at 9 bits, and be under 101% of original size. Based on the code statistics I was seeing in the decompressor, I'm pretty sure there's a bug somewhere in GZipStream's Huffman tree generator.

As for multiple passes, you don't need them, and zlib doesn't do them either. The way that a Deflate compressor generally works is that it builds a buffer of LZ tokens up to some threshold, builds Huffman trees based on the token statistics, encodes the tokens using the trees, and then flushes out the result. Repeat, rinse, lather until the source stream ends. During the process of building the Huffman trees it is simple to compute the number of bits that will result, and switch to either static trees or a stored block if it is a loss. In other words, you can't guarantee that there isn't a better set of blocks, but you CAN guarantee that you're never worse than the stored case. You do have to buffer the uncompressed portion of the data for that block in order to be able to write the stored block, but you need to maintain some buffer anyway in order to do the sliding window compression.

Incidentally, the GZipStream implementation is also poorer on uncompressed data. I concatenated VirtualDub's entire source code into a 5.3MB text file, and ran it through both the above program and gzip on default settings. GZipStream compressed it to 1.3MB, while gzip got it to 1.1MB and also ran noticeably faster.

Phaeron - 05 12 06 - 23:39


The power of GNU software... But you could have a look at 7-zip's implementation of the Zip format (and their own 7z format is also quite impressive).

Mitch - 06 12 06 - 04:58


As I understand, VirtualDub has its own routines for things such as JPEG and PNG handling, instead of using already available ones. Is there a reason for this that we might know?

About the matter at hand - did you check if the trees that they are using are dynamic? Perhaps they always use the same tree, tuned for certain kind of data. How often is there a block? Perhaps they flush very often to avoid decoder latency, so you can use it for interactive stuff (a la SSH) without flushing manually.

Rich - 07 12 06 - 20:14


I wrote mine for a couple of reasons, one being to avoid dependencies, and another being to learn. I suspect the first is the main reason Microsoft didn't use zlib either.

GZipStream/DeflateStream always uses a single compressed block in "dynamic" tree mode. I checked the trees again, and guess what -- you're right! They're always the same trees, and in fact, they've been hardcoded for text, with control values getting 14 bits and spaces getting 6. This definitely explains why it performs so poorly on binary data, and why it's suboptimal for text as well, as it has lots of leaves for codes that aren't usually used, just in case they appear. Perhaps I missed it, but I didn't see this documented. zlib and gzip, of course, calculate tailored Huffman trees, and thus perform better.

So, moral of the story is, if you want to compress binary data, definitely find another facility. Heck, even P/Invoke to LZOpen() would probably be better.

Phaeron - 08 12 06 - 00:43


After reading your article, (which is outstanding), I'm adding binary files uncompressed to my output file without difficulties, using MemoryStream vs. DeflateStream. Noting the file extension, for now, to get functionality I want, e.g. ".zip", etc... The header file stores both CompressedSize and UncompressedSize, which I set equal for these uncompressed files, so that when decompressing bytes are pulled without using DeflateStream. Huffman trees... great article.

John Askew - 22 01 07 - 15:29


I was finding the zip system was inflating my zip files to a ridiculous degree when they contained JPG files. I really only wanted to use it as a convenient container, but a container that contains another 40% in packing peanuts is not convenient. Ended up using SharpZipLib instead, which actually managed to make the JPG files smaller, and had a super-convenient FastZip class that did it all in one line.

Vista's handling of zip files is hideously slow, too. Not sure if that's a UI issue, but whatever it is, it's depressing. They've had years to get it right . . .

Laurence Parry (link) - 29 01 07 - 10:51


ps: this problem has been reported to Microsoft: http://connect.microsoft.com/VisualStudi..

Cheeso (link) - 13 02 09 - 14:30


I have just released a new version of ZipStorer with Deflate support at: (removed)

Jaime Olivares - 30 07 09 - 15:13


I removed your link because as far as I can tell, your code only handles the Zip file format and delegates to DeflateStream to do the actual compression, which means you don't actually do anything about the problem described here. Am I incorrect?

Phaeron - 30 07 09 - 16:47


You get much better results if you push the whole stream at once through the compressionstream instead of doing it yourself in small blocks.

Take a look here:

http://bytes.com/topic/net/answers/67948..

your mother - 28 09 09 - 02:11


If by better you mean "not abysmally bad," then yes. There's still no excuse for the PDF file growing by 20%. I'd also counter that the compression implementation should not be so highly sensitive to input block sizes; zlib certainly isn't.

Phaeron - 28 09 09 - 15:19


"This definitely explains why it performs so poorly on binary data."

The problem that I ran into while compressing binary data is the fact that when wrapping a BinaryWriter around the DeflateStream, the buffer for the BianryWriter is intrinsically low. So every write compresses to the stream instead of buffering a larger chunk of data. However putting a BufferedStream with a decently sized buffer around the BinaryWriter/DeflateStream will achieve better compression ratios.

I know this doesn't address the issue originally noted, but I thought someone might benefit from that knowledge.

Anon - 23 11 09 - 08:20

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.