DATA COMPRESSION #2 ------------------- By Frankenstein Thanx to Spookryder In the previous issue The Gatekeeper (TG) already told you about lots of different methods. Read it first if you haven't done it already! CLEAR UP TIME! You poor guys read about LZW, packers, arcers and MAGNUS crunchers. For those who aren't familiar with this stuff, it could be very confusing. What you have to know is that it all has to do with data compression! Data compression can be divided into two groups: - Substitution methods - Statistic methods TG mentioned this already in issue #3. He was talking about 'dictionary based' methods, which we call substitution methods during this article. Substitution methods: - Run length encoding - Liniar pattern matching - Lempel-Ziv-Welch (LZW) (used by ARC and PKZIP) Since WosFilm wrote two articles about LZW already, we won't talk about this method here. Statistic methods: - STATIC HUFFMAN ENCODING or: minimal redundancy method - DYNAMIC HUFFMAN ENCODING also called adaptive Huffman (used by LHARC) - ARITHMETIC ENCODING Okidoki! (I like this phrase from M.A.S.H.) Let's begin.... RUN LENGTH ENCODING (RLE) ------------------------- This is a fairly simple method which is very useful for compressing pictures and other sequentialy repeating data. With run-length encoding we replace a sequence of the same bytes by a code. This code includes: - UNIQUE byte (code mark) [BYTE] - replace byte IMAGE [BYTE] - AMOUNT of replaced bytes [BYTE] So there has to be one byte which isn't used in the entire file. This byte will be used as a code mark for the unpacker. Because a code requires three bytes (UNIQUE, IMAGE and AMOUNT) it's only useful to replace sequences which are longer than three bytes. But what if there's no UNIQUE byte in the file? Well, a UNIQUE byte doesn't have to be a never occuring byte! In fact you only have to know which byte occures the least. Whenever a UNIQUE byte occures in the file, it should be replaced by a code. The replace byte IMAGE is the same as the UNIQUE byte in such case. The disadvantage of this system is that you loose two bytes every time a single UNIQUE byte appears in the file. But since the UNIQUE byte is the least occuring byte of the file, it won't be noticed in most cases. The method: - Create a table of frequences (for each byte a counter of how much that byte occures in the file). - The UNIQUE byte should now be set to the byte image of the lowest frequency in the table. - Now replace byte sequences (longer than three) by a code (UNIQUE, IMAGE, AMOUNT). - When a UNIQUE byte is found, it has to be replaced by a code (also if it's only one single byte). Mostly sequences of UNIQUE bytes never happen in a file. Therefore you could also choose for a system of replacing the UNIQUE bytes by a code of two bytes, namely: - UNIQUE byte - UNIQUE byte (byte image) When the UNPACKER finds a sequence of two UNIQUE bytes it shouldn't expect an AMOUNT anymore. Koala pictures are packed with RLE, but it's slightly different from the method described here. Mostly koala packes a few bytes better than the system with a unique byte. >> How come? << Ok, let's take a look.... Example data stream: 4,4,4,4,4,4,4,4,5,6 7,8,9,3,3,3,3,3,3,3 First eight bytes can be replaced. Then there are five bytes which can NOT be replaced, followed by a replaceable sequence of seven bytes. The packed file will be like this: [128+8,4], [0+5,5,6,7,8,9], [128+7,3] Indication byte: 128 (bit 7 set) = fill byte follows 0 (bit 7 clear) = not packable The amount is included in the indication byte thus it can replace sequences of 127 bytes (128+127=255). But also notice that it now can replace sequences of three bytes because it only uses two bytes for the code! LINEAIR PATTERN MATCHING ------------------------ This method also replaces strings by a code. It searches for strings which appear for the second time in the file. A buffer is used to keep track of the string being searched. The code is build as follows: - UNIQUE byte - POSITION - match LENGHT And now you expect me to tell you how this method works right? Wrong! I'm getting a little tired of this article. Next time you'll read more about this method. Then I'll also describe the static and dynamic Huffman methods. Also look out for the source code of the static Huffman depacker. Real freaks can try to rip this routine from the 'Watch Diz' demo! (I called it the GIGA unpacker). If you know more about the methods mentioned in this article, or maybe know about other methods, please write to Mega Magazine.