fcomp Fox Compression 2: Size is everything by Bluefox Icy This algoritm is copyright Bluefox Icy under the LGPL as it exists on April 28, 2003. LGPL v2.0 Definitions: Base 0: A number that starts at 0. So 1 byte base 0 counters are ranged from 0 to 255. Base 1: A number that starts at 1. So 1 byte base 1 counters are ranged from 1 to 256, with 0x00 being 1 and 0x01 being 2 and 0xFF being 256. Order of Redundancy: A numeric count of how much a specific string is used to reference other data in a file's compressed output. fcomp uses a compression algorithm that is focussed on speed only. It uses little RAM, and was intended for kernel-level RAM compression and for packing executable files on 1 Mhz 6502 processors (its creation purpose). fcomp2 is quite different. fcomp2 uses compression and decompression routines by Bluefox Icy. It was created literally with size in mind only. fcomp2 compression should be sane, but there's always the warning that I wrote the boyer-moore search myself, and someone really needs to FSA-prove that it works. Replace it with KMP, brute force, or your own equivalent BM if you want a 100% guarentee. fcomp2 compressed data is a mixed backpointer/table data format. Officially, fcomp2 is a floating size backpointer format; backpointer size and max distance is capable of switching along the way. Most implimentations will want to use a fixed size backpointer though, usually 24 bit (16 MB). fcomp2 uses a pointer table and various pointer types. The pointer types are quite diverse. A general pointer container looks like so: TYPE SIZE DESC Byte 1 Type of pointer X X The pointer itself 3Byte 3 24 bit forward reference indicating the location of the next pointer. It is base 0, and a 0 indicates EOS (just like in fcomp) Literal pointers (1) point back into the stream and may be 8, 16, 24, or 32 bit. These pointers are standard go-back-and-copy-N pointers, like fcomp uses. They provide a pointer within the window that the scan is using. TYPE SIZE DESC Byte 1 Size N in bytes of pointer data NByte N Distance back to look, base 1 (0 == 1, range is 1..2^N) NByte N Distance to copy, base 1 (0 == 1, range is 0..2^N) Table pointers (2) are extra power. Table pointers are pointers that reference a table which holds the absolute locations in the stream (not including the table) of data that each pointer points to. In reality, all back pointers can be replaced with table pointers. Please note that by stream, we mean the actual pointer-and-data stream that is processed in the compression and NOT the supplimentary header or table. TYPE SIZE DESC Byte 1 Size N in bytes of pointer data. Increasing the size increases your range in the table, i.e. 1 byte can access entries 0..255, 2 can access 0..65535, 3 0..16777383, and so on. NByte N Index of the entry to copy NByte N Number of bytes base 0 to skip from the beginning NByte N Number of bytes base 1 to copy The final type of pointer is a NULL (0) pointer. If the pointer is null, it simply is blank, i.e. X = 0 in the general pointer container definition. A null pointer is always 4 bytes long including the general pointer container defined above, and 0 bytes long not including it. It looks like this: TYPE SIZE DESC Yes, this is complete. The stream format is: fcomp2 stream: TYPE SIZE DESC 3Byte 3 Length of X base 0 (for consistency) stream X a bunch of bytes to take as pure stream Pointer Y A pointer inside a general pointer container The last 2 repeat until EOS, which is where the general pointer container has its next pointer as a 0. The stream should always end with a pointer with a forward reference of 0, either Literal, Table, or NULL. The fcomp2 table is pretty simple. It points to offsets in the stream of data, which is read by a decompressor and used to give further compression/decompression performance. This works when things are out of reach of backpointers. Also, a very good implimentation may take the time to optimize the table and stream; that is, it may move all the most commonly used table entries to the beginning and use the smallest pointer data size possible. The fcomp2 table is in itself not built for size. Data of multiple redundancy is initially entered into the table as a 32 bit (4 byte) offset. The length is not stored; the compressor will find an offset from this within the reach of the next table entry at compression time, while the decompressor reads entry, skip, and copy. To do this more easily, the compressor should keep the data table in the order the data appears in the file until compression is finished, and then sort the table based on order of redundancy (aka how many times each entry is used) from greatest to least. As the table is sorted, the stream and table both may be adjusted so as to squeeze out more space (i.e. the most used table entries are referenced as 8 bit pointers, changing the pointers to point at these and use 8 bit storage changes the compressed stream). Note that altering the size of the compressed stream should not alter the table, as it is in reference to the uncompressed stream. fcomp2-table: TYPE SIZE DESC Dword 4 A 32 bit offset of the data in the uncompressed stream An fcomp2 data file contains a few things. First are 2 longs: Table and Stream offsets in the file. Second is some data, the table, and the stream. The "some data" can be anything (encryption flags, CRC's, etc). fcomp2 data file: TYPE SIZE DESC DWord 4 32 bit offset base 0 (this is AT 0) of where the table is in the fire DWord 4 32 bit base 0 offset of the fcomp2 stream stream X Whatever you want until the table occurs fcomp2-table Y The table fcomp2-stream Z The stream. The beginning of this is exactly where a pointer to offset 0 in the table would point to I would encourage the open source community to experiment with implimenting this compression schema. It is VERY RAM intensive, but a good implimentation should be able to squeeze great compression ratios from this. Here are the requirements for the best implimentation: 1. Intelligent Table Generation Table entries are each 4 bytes in size. A table should not be generated at a point less than 4 bytes from the previous table. 2. Optimized Table Sorting The entries most used in the table should be placed at the beginning; the table should be sorted in descending reference order. Then the pointers must be adjusted (must as in this will break the stream if you don't do this) in the compressed stream to point to the new locations of the table entries. In adjusting the pointers, the compressor should shrink them to the smallest possible size given the index number they must reference and the distance/vector they must skip and copy. For the best optimization, this step should also take into account the distance/vector sizes in the pointers, to assess how many pointers would gain and lose size in each range of index size (1/2/3/4 bytes) and compute the total size gain (positive or negative) for each pointer given the positions, and adjust according to these final scores. 3. Intelligent Table Discarding When table entries in the table are close enough together, the pointers which reference them may be capable of using only a small set from a larger set of table entries. If this is the case, the compressor should remove those table entries and readjust all related pointers. If a compressor is written, it does not have to follow any of the above to have compatible output with fcomp2 binary files; however, the above guidelines give the best compression ratios for fcomp2 compressed files.