LZW DATA COMPRESSION -------------------- Written by: WosFilm (Sweden) -"OH NO!!" I hear all of you cry out. -"It's that maniac WosFilm from Sweden writing a lot of crap again!" But no, be calm! This is serious stuff! I don't know which techniques all of you use when compressing data, but here is a rather good one called LZW (Lempel- Ziv- Welch) which is best used to compress text, screens etc. (highly redundant stuff...). I haven't made any programs, because right now I haven't got the time (and I'm not a very good coder anyway...) but hopefully I might get a chance to see what you folks can make out of it. The algorithm is actually very simple: LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming characters. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output replacing the string of characters. Many other techniques use this method too, but they have to save the stringtable along with the compressed code, when the LZW stringtable automatically builds up when compressing as well as when decompressing. One of the problems when using this algorithm on an 8-bit computer is that the code generated when compressing, must have more bits in it than a single character, since codes 0-255 are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. E.g. when running with 12-bit codes, 0-255 refer to individual bytes and 256-4095 refer to strings from the table. Here are the compress and decompress algorithms in a form everybody ought to understand: LZW COMPRESS: STRING=get input byte WHILE there are still input bytes DO CHAR=get input byte IF STRING+CHAR is in the table THEN STRING=STRING+CHAR ELSE output the code for STRING add STRING+CHAR to the table STRING=CHAR END of IF END of WHILE output the code for STRING LZW DECOMPRESS: Read OLDCODE output OLDCODE WHILE there are still input bytes DO Read NEWCODE IF NEWCODE is not in the table THEN STRING=get string for OLDCODE STRING=STRING+CHAR ELSE STRING=get string for NEWCODE END of IF output STRING CHAR=first character in STRING add OLDCODE+CHAR to the table OLDCODE=NEWCODE END of WHILE A sample string (/WED/WE/WEE/WEB/WET) is used to demonstrate the algorithms: Compression process: Input string: /WED/WE/WEE/WEB/WET CHAR Code String input output table --------- --------- --------- / W / 256=/W E W 257=WE D E 258=ED / D 259=D/ W E 256 260=/WE / E 261=E/ W E E 260 262=/WEE / W 261 263=E/W E B 257 264=WEB / B 265=B/ W E T 260 266=/WET T If you step through the start of the algorithm for this string, you can see that in the first pass through the loop the system performs a check to see if the string /W is in the table. Since it doesn't find the string, it outputs the code for / and adds the string /W to the table. After the system reads in the third character, E, the second string code, WE, is added to the table, and the code for W is output. This process continues until, in the second word, the characters /W are read in , matching the string number 256. In this case, the system outputs code 256, and adds a 3-character string, /WE, to the table. And so it continues. As you can see the table fills up rapidly, and that's another problem with this technique (unless you have a 130XE maybe...). Step through the uncompressing process below to see how the string table is built up to be an exact duplicate of the table in the compressing process. Decompression process: Input codes: / W E D 256 E 260 261 257 B 260 T NEW OLD STR. CH. STRING CODE CODE OUTP. TABLE ------ ---- ----- ---- ------- / / / W / W W 256=/W E W E E 257=WE D E D D 258=ED 256 D /W / 259=D/ E 256 E E 260=/WE 260 E /WE / 261=E/ 261 260 E/ E 262=/WEE 257 261 WE W 263=E/W B 257 B B 264=WEB 260 B /WE / 265=B/ T 260 T T 266=/WET Results of LZW compression: Shown by the examples the LZW technique works best when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing text (compression levels of at least 50%) or highly redundant database files (as best about 10% of the original size!). Other data files, like programs, obviously varies a lot. The technique does not well when compressing very large files, since there is a fixed number of strings in the table (4096 strings for 12-bit codes etc.), and when all strings are filled the compressing ratio begins to degrade if the later section of the file has other characteristics than the section for the string table. One way of solving this can be to keep track of the compression ratio, and after a certain amount of degradation the table is flushed and gets rebuilt from scratch. Another method is to monitor how frequently the strings are used, and erase those strings that are rarely used. So the main problem with LZW compression is the management of the string table; searching through the table..., it fills up quickly..., it has not a fixed size, although there is a fixed number of strings, since the size depends on the sum of all strings, and the individual strings differ in size... But I'm sure some of you can get around these problems, and I hope you do because this technique seems to be very efficient although the algorithms are extremely simple. If any of you manages to make something out of this, I'd be glad to know... * WosFilm * (examples taken from Dr.Dobb's Journal, October 1989)