Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

The ARC file format

Finding comprehensive specs for a file format is often hard.

There are a few file formats that have really well thought out documentation, such as PNG. Then there's the archive format ARC, which I was trying to write a decoder for. The great thing about search engines on the Internet is their ability to magnify the popularity of a single document called "The ARC archive file format" that doesn't actually tell you most of what you need to know about the format, such as the bitstream ordering for bit packing.

Now, there is source code available for ARC extraction, but using source code isn't always the best idea. First, there can be legal issues if the licensing or origin of the code turns out to be unfavorable, but another problem is that source code often isn't very good documentation and often still needs to be reverse engineered a bit to extract the essential bits... assuming it's even comprehensive. As such, I often avoid using source code as a reference if I can. Fortunately, the compression methods used here aren't crazy like arithmetic encoding and can be eyeballed in a hex editor for verification, so with some elbow grease the rest can be determined from existing archives and decoders.

Here's what I've determined:

Header fields

The ARC file format doesn't have a main file header, only individual headers for each file or block. The best doc I've found for the header is one off of Simtel:

http://cdn.simtel.net/pub/simtel/00/03/99/43/arc_file.inf

For the most part the header is documented pretty well, but this file has clarifications in several files that are documented incorrectly or omitted elsewhere:

Compression mode 3 (packed)

A form of run-length encoding, where the byte 0x90 is used to denote a run. This means that like PCX encoding, it has the dubious honor of favoring some byte values over others. 0x90 0x00 is the escape for the 0x90 literal, while any other follow byte is the number of times to repeat the previously emitted byte, plus one. The wiki for The Unarchiver also documents this format as Rle90:

http://code.google.com/p/theunarchiver/wiki/Rle90Algorithm

The algorithm is also used in a pre-pass on many of the later compression algorithms.

Compression mode 4 (squeezed)

A combination of run-length encoding (RLE) and Huffman coding. The run-length encoding is the same as Packed (mode 3) and is decoded after the Huffman decoder. The compressed format is as follows:

This form of tree encoding has the advantage that it's easy to read and can be used directly in decoding, but takes quite a bit of space, 1K for a full tree. Canonical Huffman tree encoding is a lot more efficient -- just using 4-bit lengths drops this to 128 bytes.

Compression mode 8 (crunched)

This is a combination of run-length encoding (RLE) and Lempel-Ziv-Welch (LZW) encoding. The run-length encoding is the same as Packed (mode 3) and is decoded after the LZW decoder. As usual, there are crucial details missing from most documents:

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.