LZW COMPRESSION AND DECOMPRESSION Copyright (C) 1993 Homer Wilson Smith Redistribution rights granted for non commercial purposes. The LZW compression algorithm is probably one of the most mind boggling algorithms to come along, but it is also one of the most brilliant, so it is worth understanding well. It's really quite mind opening. Ever program something you didn't fully understand, but because it worked fine you let it go, and then months later found yourself still wondering why your code worked? Well that's what happened to me with LZW compression. My programs worked, for a very long time they worked, until one day I went to send a large tiff to a color separator, and bang nothing would read it. Corel Draw wouldn't read it, CSHOW wouldn't read it, my own damn tiff program wouldn't read it, and that REALLY pissed me off. I had written my tiff programs with barely passable understanding of the algorithm, following along the pseudo code provided by others, with little personal comprehension myself. I had read a number of descriptions trying to make LZW compression and decompression clear to me, but basically all I ended up with was a partial understanding of the encoding process and no understanding at all of the decoding process. I kind of programmed it blind, and hoped it would work. It took me many hours to find the bug, it was one of those 'it crops up every hundred million years' kind of bug, and I didn't know what I was doing anyhow, so it wasn't easy. But from all bad things comes something good (I think) and this paper is the result. I finally understood both sides of the compression and decompression dichotomy, and I am reasonably sure my programs will work because I understand them and not because I followed someone else's code. The proof of course is in the pudding, and the pudding is whether or not I can explain this to others so that they get it the first time. The algorithm is really quite brilliant, and worth knowing just for the fun of it. This paper is not meant to replace everything else you might read on the subject of LZW encoding or TIFF and GIF files, but it is intended to make all that other stuff clear. This is what I wish someone had written for me when I first approached the subject. WHY DATA COMPRESSION The problem with data is it takes up so much damn space, and with the cost of hard disk space and the slowness of telecommunications, it behooves us to find a way to make data smaller, to make it take up less room, so that we can then use up the space we free up to fill it with more data! It actually takes less time to compress data than it does to transmit it, so its a good cost saving to compress the data first, then transmit it and uncompress it at the other end. RUN LENGTH ENCODING Various encoding schemes have been devised to take care of this need, one of the earliest most obvious ones being RUN LENGTH ENCODING. Run length encoding takes advantage of the fact that often many pixels right next to each other are the same pixel, so rather than sending them all one by one they are counted up into a RUN and two bytes are sent instead, the first byte telling how many pixels there were of one number, and the next byte the number it was. Thus a run of pixels that looked like this 5 5 5 3 3 3 9 9 9 9 9 1 1 1 1 would be run length encoded as 3 5 3 3 5 9 4 1 an obvious saving of space. The longer the runs are the more space is saved as each color run is represented by only two bytes, a count and a color. If your counts can go only up to 255, then if the run is longer you have to start a new run, but if your counts can go up to 32767, then you can handle longer runs before you have to start a new run. The problem comes in when the data does not consist of large expanses of identical pixels, who wants to look at a single patch of red anyhow. A worst case scenario of course is when every pixel is different such as, 2 3 4 5 6 7 8 9 etc. Encoding this as RLE would produce, 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 which is an obvious waste of space. One solution to this is to insert an INTRODUCER byte before each sequence telling whether it is a horizontal run, or a stretch of one byte per pixel. Thus runs may start with positive counts, and stretches of one byte per pixel may start with a negative count. Thus the sequence 3 3 3 3 3 1 2 3 4 5 6 would be compressed as 5 3 -6 1 2 3 4 5 6 giving some compression, but its obviously not an ideal solution. There is also tremendous complexity involved in programming the ideal compression for data that has a lot of 2 byte runs in it. For example, 3 3 4 5 3 3 4 5 3 3 4 5 is better put out as a single stretch of one byte per pixel -12 3 3 4 5 3 3 4 5 3 3 4 5 rather than as a mixture 2 3 -2 4 5 2 3 -2 4 5 2 3 -2 4 5 because of the the one byte overhead in switching from horizontal runs to one byte per pixel. One might make a simple rule that all two byte runs should be put out as one byte per pixel, except that this is only true if you have already opened a one byte per pixel stretch, otherwise 2 byte runs should be put out as a horizontal run (count,byte). In any case run length encoding (RLE) only produces good compressio when the data is even and unchanging. Fractals of course are the antithesis of unchanging data, so RLE does not work well on fractals or any other natural data like video frames or even written work. In other words if you have a patch of red to compress use RLE. Otherwise keep reading. The LZW Algorithm There are four concepts that need to be understood. They are, 1.) INPUT CODE 2.) TABLE INDEX 3.) TABLE CODE 4.) OUTPUT CODE The purpose of LZW compression is to turn the input code stream into an output code stream of lesser length, and then later to reconstruct the input code stream from the output code stream. This accomplishes the purpose of compression and decompression. The above four concepts can be visualized as follows: INPUT CODE TABLE OUTPUT CODE STREAM INDEX CODE STREAM ------------------------------------------------------------------------ CODE, CODE, CODE... (INDEX, CODE) CODE, CODE.... (INDEX, CODE) (INDEX, CODE) . . CODE. A code is a byte wide entity consisting of 8 bits and ranging in decimal value from 0 to 255. They can be considered as characters, or as integer numbers, it doesn't matter. They are NOT signed integers however. INPUT CODES, TABLE CODES and OUTPUT CODES are all of this format. INDEX. An index is a two byte wide SIGNED entity consisting of 16 bits and ranging in decimal value from -32768 to 32767. It is called an index because it refers to an earlier entry in the table. For example, if table entry number 400 consists of the (index,code) pair (312,5), then 312 is called a table index and points BACKWARDS to table entry 312, and 5 is a table code generated in the process of compression. Table entry 312 will itself have an (index,code) pair that will point to an even earlier table entry, which will point to an earlier table entry ad infinitum until you come back to the first table entries before which there are none. These first entries in the table are called the ROOT AREA. If a table entry has -1 for an INDEX, that means that that table entry does NOT refer to any earlier table entry. -1's appear as table indexes either 1.> when a table entry is still unused or 2.) when it resides in the ROOT AREA as described below. Table entries in the root area do not point to earlier table entries because there are no earlier entries. Table entries that haven't been used yet don't point to earlier table entries because they haven't been used yet! In either case the table index will be -1. 0 by the way is a valid table index which points to the very first table entry in the root area, table entry number 0. INPUT CODE STREAM. The input code stream is the sequence of bytes that you wish to compress into a hopefully shorter output code stream. OUTPUT CODE STREAM. The output code stream is the sequence of bytes that is the final product of compression, and is hopefully shorter than the input code stream and RECONSTRUCTIBLE BACK into the input code stream. It's no good compressing something if you can't uncompress it! TABLE. The table is a two column table of integer numbers of the following form: TABLE POSITION (INDEX,CODE) ------------------------------------------------------------------------ ----- 0 (-1 , 000) | 1 (-1 , 001) | 2 (-1 , 002) | . . ROOT AREA . . | 255 (-1 , 254) | 255 (-1 , 255) | 256 (-1 , 256) Clear Code (CC) ----- 257 (-1 , 257) End of Information Code (EOI) ----- 258 (INDEX,CODE) | 259 (INDEX,CODE) ADDED . ENTRY AREA . | 278 (INDEX,CODE) | 279 (INDEX,CODE) <---TABLE POINTER to latest entry | 270 (-1 , 0) As yet unused entries | 271 (-1 , 0) " | . . ----- 4095 (-1 , 0) End of table. The table consists of only the two columns of numbers, the table position numbers are NOT in the table and are only shown for clarity. The table consists of 2 areas, the ROOT area and the ADDED ENTRY AREA. ROOT AREA. The root area consists of the first 258 entries in the table numbered from 0 to 257. The entries from 0 to 255 are initialized at program start up to the numbers shown and do not change during the duration of the program. Entries number 256 and 257 are also set during program start up and are used to control the process of compression and decompression. Their CODE values are otherwise arbitrarily chosen. Because their values are greater than 255, table codes must use 16 bit wide integers, even though codes have only 8 significant bits. The table indexes in the root area are set to -1's because they do not point to any earlier table entry. The table codes in the root area are set to the basic alphabet of possible characters in the input code stream, in this case the integers from 0 to 255. The Codes for CLEARCODE and END OF INFORMATION CODE are arbitrarily set to 256 and 257. This leaves table position 258 as the first empty position to ADD new entries into the table when the compression process actually starts. ADDED ENTRY AREA. At program start up positions 0 through 257 are initialized as described above and a pointer is set to the last entered entry in the table, in this case 257. The remaining table entries are initialized to (-1,0) as shown and usually there are a total of 4096 entries in the table numbered 0 to 4095, including the root area. During the process of compression the program scans the input code stream and adds new entries to the table according to the LZW compression algorithm which we will discuss at length below. Each new table entry consists of an (INDEX,CODE) pair which is determined by the input code stream and the LZW algorithm. As each new (INDEX,CODE) pair is found the table pointer is incremented by by one and the new (INDEX,CODE) pair is inserted into the table at the position pointed to by the table pointer. In the following discussion we wish to prove the following points. 1.) The table is constructed from and totally determined by the input code stream. The table is usually shorter than the input code stream and so compression has taken place in the construction of the table. This is because each table entry can represent more than one character in the input code stream. 2.) From the table alone the input code stream can be reconstructed. Thus if you have the complete table you can reconstruct the input code stream. 3.) The right hand column of table codes can be reconstructed from the left hand column of table indexes alone. Thus if you know the left hand table indexes you can recompute from that information alone the right hand table codes and thus the input code stream. 4.) Thus the table indexes can act as the final output compressed data stream, or in other words, the output codes ARE the table indexes from the added entry area. Do not get table codes confused with the output codes. The table INDEXES become the output codes. Because the table indexes can go up to 4095, they need 12 bits to represent them. Although they are computed as 16 bit wide integers, when they are finally output to the output data stream, they are packed at 12 bits each. Thus an input data stream of 8 bit integers gets compressed and output as an output data stream of 12 bit integers. Since each table index can represent a unique string of varying length, overall compression takes place. For example if a single 12 bit wide table index represents the 3 character string of 'ABC' or 24 bits, that is a compression factor of 2:1. Three characters at 8 bits each are compressed into one character of 12 bits. 5.) The process of uncompressing the output code stream then works as follows. First you must rebuild the table indexes from the output code stream. The ROOT AREA of the table is trivial as it is always the same. The ADDED ENTRY AREA is almost as trivial. The table index column of the added entry area IS the sequence of output codes, as the output codes ARE the table indexes. The table code column of the added entry area is then recomputed from the table index column by means that will become obvious when the full algorithm is discussed, and that completes the rebuilding of the table from the output codes. From the completed table one can then reconstruct the original input code stream. So how does LZW work? The idea behind LZW is that not only do some pixels repeat themselves, but many SEQUENCES of pixels repeat themselves. For example take the input code stream (in bytes), ABCDEABCDEABCDEABCDEABCDE Such a sequence would not lend itself well to run length encoding because no single pixel forms a run. However the sequence of 5 pixels ABCDE certainly form a run as they appear one right after the other 5 times. One might even suggest a super run length encoding that could seek out repeating runs of SEQUENCES and perhaps encode a string like ABCDEABCDEABCDEABCDEFGHFGHFGHFGHFGH stream as something like, 5(ABCDE) 4(FGH) However LZW goes one step further and recognizes that the string ABCDE can appear anywhere in the input code stream, not just in consecutive runs, so why not just assign the string ABCDE a unique number, say 234, and then every time that string appears replace it with the single byte 234. So if 234 is assigned to ABCDE and 235 is assigned to FGH then the above 20 byte string could be compressed into the 9 bytes 234 234 234 234 234 235 235 235 235 Of course the number 236 could be assigned to the string ABCDEABCDE and 237 assigned to FGHFGHFGH, so now the same 20 byte input code stream could be output as 236 236 234 237 235 Obviously you stand to create a lot of compression this way. Just as obviously the whole scheme is utterly preposterous, as you could just assign one number, say 234, to the whole damn text and then output only that one number! The problem is clearly keeping track of what string each number stands for. That is where a string table comes in. STRING TABLES A string table is merely a table of strings of varying length and the code numbers that have been assigned to them. In this case the string is inserted into the table at the table entry position corresponding to the code number assigned to that string so that the code numbers do not also have to stored. For example let's say we have collected the following table of strings. TABLE TABLE POSITION STRING ENTRY ------------------------------------------------------------------------ 232 SDF 233 JHIKLNN 234 ABCDE 235 FGH 236 ABCDEABCDE 237 FGHFGHFGH With such a table in hand it would be easy enough to covert the output code stream, 236 236 234 237 235 back into the original input stream of ABCDEABCDE ABCDEABCDE ABCDE FGHFGHFGH FGH (Without the spaces of course.) Which is what we want. The problem of course with this method is that you have to have the table in order to know what the code numbers stand for, and the space you save by compressing your input code stream into the smaller sequence of bytes is mostly lost in having to carry the whole table along with you. This becomes obvious in the extreme example of assigning the whole text one number and putting the whole text in the table as one entry. You are actually up a byte from the original data, namely the damn code number. Not a good compression ratio. The key question is, can the table be reconstructed FROM the output code stream? If not then you have to carry the table with you to decode the output code stream, and that wastes space. If you CAN reconstruct the table from the output code stream then you can throw the tables away as you produce the output, and recover them later when you want to go in the other direction. LZW takes this one step further in that the output code stream IS the table, at least the left hand column of table indexes, and the right hand column of table codes can be easily computed from the table indexes. Once you have both you can recompute the input code stream. So how does it work? THE LZW ALGORITHM The algorithm depends on the idea that any given input string either is or is not in the table. If it is in the table it is not put in again. If the string is not in the tale it is immediately put in. But let's look at how the algorithm gets its strings. Assume we want to compress the input code stream ABABABAB Remember the characters A and B are bytes to the computer and so are really integer numbers somewhere between 0 and 255. In this case A is 61 and B is 62. The algorithm starts with the first character in the input code stream and continues to grab characters from the input stream building a string until it builds a string that is NOT in the table. The string table is first initialized with all possible single character strings, this forms the root area of the table. Let's simplify this and just assume for the moment that there are only 4 different single character strings rather than 255, and let's call them A, B, C, D. Let's also forget the clear code and the end of information code, so that our root area has just 4 entries in it, numbered 0 through 3. and the first added entry will go in position 4. Then the table will look like this after initialization. CODE STRING 0 A 1 B 2 C 3 D Since ALL possible single character strings ARE in the table in the root area the very first character that is taken from the input code stream will definitely be in the table. In this case that is the letter A. So nothing is added to the table and the next character is grabbed. This creates the string AB. AB is NOT in the table so we add it to the table at position 4. 0 A 1 B 2 C 3 D 4 AB We now do something very strange, something that will probably earn LZ&W their Nobel Prize in mathematics: we discard everything in the string but the last letter of the string and start anew. So now we have string B which is definitely in the table in the root area, so we go on and grab the next character in the output code stream which is another A. The string BA is NOT in the table so we add it to position 5. 0 A 1 B 2 C 3 D 4 AB 5 BA We discard all but the last letter of the string leaving us with A again which is in the table in the root area. We grab the next character in the input code stream forming AB. Now this time AB is already in the table from the last time we came across it, so we continue on to the next character. This gives us ABA. ABA is NOT in the table so we enter it in position 6. 0 A 1 B 2 C 3 D 4 AB 5 BA 6 ABA Keeping the last letter A, we grab the next character in the input code stream forming AB. This is in the table, so we get the next character forming ABA which is ALSO now in the table. Adding the next character we get ABAB which is NOT in the table so we add it at position 7. 0 A 1 B 2 C 3 D 4 AB 5 BA 6 ABA 7 ABAB At this point you can clearly see how to build the rest of the table from the input code stream. You do this until you run out of input code stream or table room at which point you clear the table and start over again with a new table. Let's assume we ran out of input codes, so as usual we discard all but the last character of the last string and that leaves B. Since there are no further codes to add to B, it becomes the last string. 0 A 1 B 2 C 3 D 4 AB 5 BA 6 ABA 7 ABAB 8 B And there we have our completed table. Now here is where the brilliance comes in. Each string is added to the table because it was not already in the table. But each string is built up one character at a time so each one of its predecessor strings HAD TO BE IN THE TABLE for that string to get as long as it did before it got put in the table. In other words its that LAST CHARACTER at the end of each string in the table that made that string NOT BE in the table. This means that if you take any string in the table and remove its last character you must have a string that was already in the table earlier. Thus each string can be split up into its last character and an index into an earlier entry in the table. And voila you have the (INDEX,CODE) form of the string table. TABLE TABLE IN TABLE IN POSITION STRING FORMAT INDEX,CODE FORMAT 0 A -1 A 1 B -1 B ROOT AREA 2 C -1 C 3 D -1 D 4 AB 0 B | 5 BA 1 A ADDED ENTRY 6 ABA 4 A | 7 ABAB 6 B 8 B 1 x x means doesn't matter. So the first thing we need to demonstrate is that the input code stream can be reconstructed from the table, particularly from the table indexes in the added entry area, 0 1 4 6 1. The first thing we must realize is that each table index refers to an earlier complete string already in the table. For example in position 7 there is a (6,B). The 6 refers to position 6 where there is a (4,A). The 4 refers to position 4 where there is a (0,B) and the 0 refers to position 0 where there is a (-1,A) The -1 means you are in the root area and the backwards search ends. So the (6,B) in position 7 means that position 7 contains a string which is a B preceded by the string in position 6. But the string in position 6 is an A preceded by the string in position 4. The string in position 4 is a B preceded by the string in position 0. And the string in position 0 is an A preceded by nothing. So what string does table index 6 refer to? This is the way you figure it out. The | means refers to. This says that index 6 (6,B) pos 7 refers to an A preceded by what's in position 4, | what's in position 4 is a B preceded by what's in (4,A) pos 6 position 0, and what's in position 0 is an A | preceded by nothing 'cuz its root. This gives (0,B) pos 4 us the string ABA for index 6. Looking in the | string form of the table we can see that is (-1,A) pos 0 right. The entire string in position 7 is (6,B) or ABAB. The index 6 alone refers to a prior string of ABA, specifically the string in POSITION 6, that's why its called index 6, because it refers backwards to position 6. In the same way you can compute the string values for each of the indexes in the output code stream 0 1 4 6 1. Table Table Position Index 4 INDEX 0 = A 5 INDEX 1 = B 6 INDEX 4 = AB 7 INDEX 6 = ABA 8 INDEX 1 = B Thus 0 1 4 6 1 = A B AB ABA B which is what we started with, minus the spaces which I added for clarity. GETTING THE TABLE CODES FROM THE TABLE INDEXES. So what happens if you don't have the table codes, but only the table indexes? Well you remember that when a string was built that was NOT in the table, it was then put in the table at the next available position. Then all characters before the last character were discarded and the last character was kept as the first character of the next string to be built. More accurately once a string is found that is NOT in the table, say ABAB, this is represented by an (index,code) pair such as (6,B). The part of the string corresponding to the index is then thrown away and the remaining single table code becomes the first character of the next string. Once a string is built starting with that first character it too is added to the table and the last character only is kept to become the first character of the next table entry. Therefore it is obvious that the last character of each table entry is the first character of the next table entry! Thus the table is always of the form, TABLE TABLE IN POSITION STRING FORMAT 0 A 1 B (ROOT AREA) 2 C 3 D 4 A...D I have used random characters here for 5 D...C demonstration purposes only. The last 6 C...B character of the string in position 4 is 7 B...D the first character of the string in 8 D position 5, etc. The ... means an arbitrary number of 0 or more intervening characters. So how does this help us find the codes from the indexes alone? It should be clear from an earlier discussion that from the indexes AND codes the original string represented by any (index,code) pair can be reconstructed by following the indexes back to root and accumulating the codes of each touched upon entry along the way. But if you don't have the codes how can you accumulate them?! Well the magic is although you may have no codes to accumulate, you always have the FIRST code of any string in the table because the first code of a string is in the root area whose codes are ALWAYS known. Thus from the indexes alone you can always find the FIRST character of any string represented by any index. Here's our complete table again without the codes. Remember root codes are always known. TABLE TABLE IN TABLE IN POSITION STRING FORMAT INDEX,CODE FORMAT 0 A -1 A 1 B -1 B ROOT AREA 2 C -1 C 3 D -1 D 4 AB 0 ? | 5 BA 1 ? ADDED ENTRY 6 ABA 4 ? | 7 ABAB 6 ? 8 B 1 x x means doesn't matter. For example, in table position 7 you have (6,x). Index 6 refers to position 6 which contains (4,?), which refers to position 4 which contains (0,?), which refers to position 0 which contains (-1,A). So the string in position 7 MUST start with an A! But this first character is the LAST character of the immediate preceding string in the table, in this case the string in position 6. By definition the last character of a string is that string's table CODE. So the first character of the string in position 7 is the last character of the string in position 6 and is thus also 6's table CODE. (Don't get table codes confused with output codes. A string's table code is the last character of the string. A string's output code is the table INDEX for that string!) Thus by finding the first character of any index string, you can find the table code of the immediately previous string. Going back to our previous example, let's assume we have only the indexes and no codes, except the root codes which are known. The missing codes are represented by ?'s in the below table. TABLE TABLE IN TABLE IN POSITION STRING FORMAT INDEX,CODE FORMAT 0 A -1 A 1 B -1 B ROOT AREA 2 C -1 C 3 D -1 D 4 AB 0 ? | 5 BA 1 ? ADDED ENTRY 6 ABA 4 ? | 7 ABAB 6 ? 8 B 1 x x means doesn't matter. In position 8 the index 1 points directly to position 1 in in the root area which has a known code of B. Thus B is the first character of the string in position 8 and therefore the last character of the string in position 7. Thus we know the code for position 7, its a B! In position 7, the index 6 points to position 6 whose index of 4 points to position 4 whose index of 0 points to position 0 which being in the root area has a known code of A. Thus A is the first character of the string in position 7 and so also the last character of the string in position 6. Thus we have the code for 6, its an A. Visually any index can be followed back to root (6,?) pos 7 just like we did before, the only difference being | that there are no known codes to accumulate until (4,?) pos 6 we get to root where there will be just one known | code. That code is the first character of the (0,?) pos 4 index we just traced back and the last character | of the index just before it in the table. (-1,A) pos 0 Thus by starting with the last added index in the table, and tracing it back to its first character in the root area, we can find the table CODE of the immediately prior index. Repeating the process for that index, we can then fill in the entire table. Once the table is filled in completely, we can then start a second pass and accumulate the complete strings for each index using the codes we just found, and then concatenate those strings together to form the original input code stream. Totally amazing, isn't it? Homer