LZMA
Encyclopedia
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 used to perform data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

. It has been under development since 1998 and was first used in the 7z
7z
7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public...

 format of the 7-Zip
7-Zip
7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively...

 archiver. This algorithm uses a dictionary compression
Dictionary coder
A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure maintained by the encoder...

 scheme somewhat similar to LZ77 and features a high compression ratio (generally higher than bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...

) and a variable compression-dictionary size (up to 4 GB
Gigabyte
The gigabyte is a multiple of the unit byte for digital information storage. The prefix giga means 109 in the International System of Units , therefore 1 gigabyte is...

).

Overview

LZMA uses a dictionary compression algorithm (a variant of LZ77), whose output is then encoded with a range encoder
Range encoding
Range encoding is a data compression method defined by G. Nigel N. Martin in a 1979 paper Range encoding is a form of arithmetic coding that was historically of interest for avoiding some patents on particular later-developed arithmetic coding techniques...

. The dictionary compressor produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder, using a model to make a probability prediction of each bit. Prior to LZMA, most encoder models were byte-based (i.e. they coded each bit using a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase. This is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context.

Algorithm

In LZMA compression, the compressed stream is a stream of bits, encoded using adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modelled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type.

There are 7 types of packets:
packed code (bit sequence) packet description
0 + byteCode A single byte encoded using an adaptive binary range coder. The range coder uses context based on some number of the most significant bits of the previous byte. Depending on the state machine, this can also be a single byte encoded as a difference from the byte at the last used LZ77 distance.
1+0 + len + dist A typical LZ77 sequence describing sequence length and distance.
1+1+0+0 A one-byte LZ77 sequence. Distance is equal to the last used LZ77 distance.
1+1+0+1 + len An LZ77 sequence. Distance is equal to the last used LZ77 distance.
1+1+1+0 + len An LZ77 sequence. Distance is equal to the second last used LZ77 distance.
1+1+1+1+0 + len An LZ77 sequence. Distance is equal to the third last used LZ77 distance.
1+1+1+1+1 + len An LZ77 sequence. Distance is equal to the fourth last used LZ77 distance.

The length is encoded as follows:
Length code (bit sequence) Description
0+ 3 bits The length encoded using 3 bits, gives the lengths range from 2 to 9.
1+0+ 3 bits The length encoded using 3 bits, gives the lengths range from 10 to 17.
1+1+ 8 bits The length encoded using 8 bits, gives the lengths range from 18 to 273.

The distance is encoded as follows:

First a distance class is encoded using 6 bits. The 5 other bits of the distance code encode the information about how many direct distance bits need to be extracted from the stream.

7-Zip reference implementation

The LZMA implementation extracted from 7-Zip
7-Zip
7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively...

 is available as LZMA SDK. It was placed by Igor Pavlov
Igor Pavlov (programmer)
Igor Pavlov is a Russian freelance programmer and is the creator and the maintainer of the file archiver 7-Zip and its toolset. He is also the creator of the 7z archive format...

 in the public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

 on December 2, 2008, with the release of version 4.62. It was originally distributed under the terms of both the GNU LGPL
GNU Lesser General Public License
The GNU Lesser General Public License or LGPL is a free software license published by the Free Software Foundation . It was designed as a compromise between the strong-copyleft GNU General Public License or GPL and permissive licenses such as the BSD licenses and the MIT License...

 and Common Public License
Common Public License
In computing, the CPL is a free software / open-source software license published by IBM. The Free Software Foundation and Open Source Initiative have approved the license terms of the CPL....

, with a special exception for linked binaries.

LZMA2 compression, which is an improved version of LZMA, has been introduced in version 9.04 beta, of May 30, 2009.

The reference open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 LZMA compression library is written in C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 and has the following properties:
  • Compression speed: approximately 1 MB per second
    Second
    The second is a unit of measurement of time, and is the International System of Units base unit of time. It may be measured using a clock....

     on a 2 GHz CPU
    Central processing unit
    The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

  • Decompression speed: 10-20 MB per second on a 2 GHz CPU
  • Support for multi-threading
    Thread (computer science)
    In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

    .


In addition to the original C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

, the LZMA SDK contains reference implementations of LZMA compression and decompression ported to ANSI C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, C#, and Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

. Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

 bindings also exist for the C++ library and ports of LZMA to Pascal
Pascal (programming language)
Pascal is an influential imperative and procedural programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small and efficient language intended to encourage good programming practices using structured programming and data structuring.A derivative known as Object Pascal...

 and Go
Go (programming language)
Go is a compiled, garbage-collected, concurrent programming language developed by Google Inc.The initial design of Go was started in September 2007 by Robert Griesemer, Rob Pike, and Ken Thompson. Go was officially announced in November 2009. In May 2010, Rob Pike publicly stated that Go was being...

.

The 7-Zip implementation uses several variants of hash chain
Hash chain
A hash chain is the successive application of a cryptographic hash function to a piece of data. In computer security, a hash chain is a method to produce many one-time keys from a single key or password...

s, binary tree
Binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...

s and Patricia tries as the basis for its dictionary search algorithm.

Decompression-only code for LZMA generally compiles to around 5 KB and the amount of RAM required during decompression is principally determined by the size of the sliding window used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

applications.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK