LZSS
Encyclopedia
Lempel-Ziv-Storer-Szymanski (LZSS) is a lossless data compression
Lossless data compression
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. The term lossless is in contrast to lossy data compression, which only allows an approximation of the original data to be reconstructed, in exchange...

 algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

, a derivative of LZ77, that was created in 1982 by James Storer and Thomas Szymanski. LZSS was described in article "Data compression via textual substitution" published in Journal of the ACM (pp. 928-951).

LZSS is a dictionary encoding technique. It attempts to replace a string of symbols with a reference to a dictionary location of the same string.

The main difference between LZ77 and LZSS is that in LZ77 the dictionary reference could actually be longer than the string it was replacing. In LZSS, such references are omitted if the length is less than the "break even" point. Furthermore, LZSS uses one-bit flags to indicate whether the next chunk of data is a literal (byte) or a reference to an offset/length pair.

Example

Here is the beginning of Dr. Seuss
Dr. Seuss
Theodor Seuss Geisel was an American writer, poet, and cartoonist most widely known for his children's books written under the pen names Dr. Seuss, Theo LeSieg and, in one case, Rosetta Stone....

's Green Eggs and Ham
Green Eggs and Ham
Green Eggs and Ham is a best-selling and critically acclaimed book by Dr. Seuss, first published on August 12, 1960. As of 2001, according to Publishers Weekly, it was the fourth-best-selling English-language children's book of all time....

, with character numbers at the beginning of lines for convenience.


0: I am Sam
9:
10: Sam I am
19:
20: That Sam-I-am!
35: That Sam-I-am!
50: I do not like
64: that Sam-I-am!
79:
80: Do you like green eggs and ham?
112:
113: I do not like them, Sam-I-am.
143: I do not like green eggs and ham.


This text takes 177 bytes in uncompressed form. Assuming a break even point of 2 bytes (and thus 2 byte pointer/offset pairs), and one byte newlines, this text compressed with LZSS becomes 94 bytes long:


0: I am Sam
9:
10: (6,3) (0,4)
16:
17: That(5,4)-I-am!(19,16)I do not like
45: t(21,14)
49: Do you(58,5) green eggs and ham?
78: (49,14) them,(24,9).(112,16)(93,18).


Note: this does not include the 11 bytes of flags indicating whether the next chunk of text is a pointer or a literal. Adding it, the text becomes 105 bytes long, which is much shorter than the original 177 bytes.

Implementations

Many popular archivers like PKZip
PKZIP
PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE, Inc. The common "PK" prefix used in both PKZIP and PKWARE stands for "Phil Katz".-History:...

, ARJ
ARJ
ARJ is a software tool designed by Robert K. Jung for creating high-efficiency compressed file archives. ARJ is currently on version 2.85 for DOS and 3.15 for Windows and supports 16-bit and 32-bit Intel architectures.ARJ was one of two mainstream archivers for DOS and Windows during early and...

, RAR, ZOO
Zoo (file format)
zoo is a data compression program and format developed by Rahul Dhesi in the mid 1980s. The format is based on the LZW compression algorithm and compressed files are identified by the .zoo file extension. It is no longer widely used. Program source code was originally published on the...

, LHarc use LZSS rather than LZ77 as the primary compression algorithm; the encoding of literal characters and of length-distance pairs varies, with the most common option being Huffman coding
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

. The Allegro library
Allegro library
Allegro is a free and open source software library for video game development.The functionality of the library includes support for basic 2D graphics, image manipulation, text output, audio output, midi music, input and timers, as well as additional routines for fixed-point and floating-point...

 can encode and decode an LZSS format,
and the Game Boy Advance
Game Boy Advance
The is a 32-bit handheld video game console developed, manufactured, and marketed by Nintendo. It is the successor to the Game Boy Color. It was released in Japan on March 21, 2001; in North America on June 11, 2001; in Australia and Europe on June 22, 2001; and in the People's Republic of China...

 BIOS can decode a slightly different LZSS format.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK