7z
Encyclopedia
7z is a compressed archive file format that supports several different data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

, encryption
Encryption
In cryptography, encryption is the process of transforming information using an algorithm to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key. The result of the process is encrypted information...

 and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip
7-Zip
7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively...

 archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License
GNU Lesser General Public License
The GNU Lesser General Public License or LGPL is a free software license published by the Free Software Foundation . It was designed as a compromise between the strong-copyleft GNU General Public License or GPL and permissive licenses such as the BSD licenses and the MIT License...

. The LZMA SDK 4.62 was placed in the public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

 in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 9.20.

The MIME
Internet media type
An Internet media type, originally called a MIME type after MIME and sometimes a Content-type after the name of a header in several protocols whose value is such a type, is a two-part identifier for file formats on the Internet.The identifiers were originally defined in RFC 2046 for use in email...

 type of 7z is application/x-7z-compressed.

The official 7z file format specification is distributed with 7-Zip's source code. The specification can be found in plain text format in the doc\ sub directory of the source code distribution.

Features and enhancements

The 7z format provides the following main features:
  • Open
    Open format
    An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...

    , modular architecture which allows any compression, conversion, or encryption method to be stacked.
  • High compression ratio
    Data compression ratio
    Data compression ratio, also known as compression power, is a computer-science term used to quantify the reduction in data-representation size produced by a data compression algorithm...

    s (depending on the compression method used)
  • Strong Rijndael/AES
    Advanced Encryption Standard
    Advanced Encryption Standard is a specification for the encryption of electronic data. It has been adopted by the U.S. government and is now used worldwide. It supersedes DES...

    -256 encryption
    Encryption
    In cryptography, encryption is the process of transforming information using an algorithm to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key. The result of the process is encrypted information...

    .
  • Large file support (up to approximately 16 exbibyte
    Exbibyte
    The exbibyte is a standards-based binary multiple of the byte, a unit of digital information storage. The exbibyte unit symbol is EiB....

    s).
  • Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

     file names
  • Support for solid compression
    Solid compression
    In computing, solid compression refers to a method for data compression of multiple files, wherein all the compressed files are concatenated and treated as a single data block. Such an archive is called a solid archive. It is used natively in the 7z and RAR formats, as well as indirectly in...

    , where multiple files of like type are compressed within a single stream, in order to exploit the combined redundancy inherent in similar files.
  • Compression and encryption of archive headers.


The format's open architecture allows additional future compression methods to be added to the standard.

Compression methods

The following compression methods are currently defined:
  • LZMA
    LZMA
    The Lempel–Ziv–Markov chain algorithm is an algorithm used to perform data compression. It has been under development since 1998 and was first used in the 7z format of the 7-Zip archiver...

     – A variation of the LZ77
    LZ77 and LZ78
    LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and...

     algorithm, using a sliding dictionary up to 4 GB in length for duplicate string elimination. The LZ stage is followed by entropy coding using a Markov chain
    Markov chain
    A Markov chain, named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process characterized as memoryless: the next state depends only on the current state and not on the...

     based range coder
    Range encoding
    Range encoding is a data compression method defined by G. Nigel N. Martin in a 1979 paper Range encoding is a form of arithmetic coding that was historically of interest for avoiding some patents on particular later-developed arithmetic coding techniques...

     and binary tree
    Binary tree
    In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...

    s.
  • LZMA2 - modified version of LZMA providing better multithreading support and less expansion of incompressible data.
  • Bzip2
    Bzip2
    bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...

     – The standard Burrows–Wheeler transform algorithm. Bzip2 uses two reversible transformations; BWT, then Move to front with Huffman coding
    Huffman coding
    In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

     for symbol reduction (the actual compression element).

  • PPMd – Dmitry Shkarin's 2002 PPMdH (PPMII/cPPMII) with small changes: PPMII is an improved version of the 1984 PPM compression algorithm
    PPM compression algorithm
    Prediction by partial matching is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream....

     (prediction by partial matching).
  • DEFLATE
    DEFLATE
    Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....

     – Standard algorithm based on 32 kB LZ77
    LZ77 and LZ78
    LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and...

     (LZSS actually) and Huffman coding
    Huffman coding
    In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

    . Deflate is found in several file formats including ZIP
    ZIP (file format)
    Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

    , gzip
    Gzip
    Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

    , PNG and PDF. 7-Zip contains a from-scratch DEFLATE encoder that frequently beats the de facto standard zlib
    Zlib
    zlib is a software library used for data compression. zlib was written by Jean-Loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. Zlib is also a crucial component of many software platforms including Linux, Mac OS X,...

     version in compression size, but at the expense of CPU usage.


A suite of recompression tools called AdvanceCOMP
AdvanceCOMP
AdvanceCOMP is a set of cross-platform command line data compression tools. The utilities allow modifying an already-compressed file, with the intent of reducing the file-size by optimising the compressed representation...

 contains a copy of the DEFLATE encoder from the 7-Zip implementation; these utilities can often be used to further compress the size of existing gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

, ZIP
ZIP (file format)
Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

, PNG, or MNG files.

Pre-processing filters

The LZMA SDK comes with the BCJ / BCJ2 preprocessor included, so that later stages are able to achieve greater compression: For x86, ARM
ARM architecture
ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...

, PowerPC
PowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...

 (PPC), IA-64 Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

, and ARM Thumb processors, jump targets are normalized before compression by changing relative position into absolute values. For x86, this means that near jumps, calls and conditional jumps (but not short jumps and conditional jumps) are converted from the machine language "jump 1655 bytes backwards" style notation to normalized "jump to address 5554" style notation; all jumps to 5554, perhaps a common subroutine, are thus encoded identically, making them more compressible.
  • BCJ - Converter for 32-bit x86 executables. Normalise target addresses of near jumps and calls from relative distances to absolute destinations.
  • BCJ2 - Pre-processor for 32-bit x86 executables. BCJ2 is an improvement on BCJ, adding additional x86 jump/call instruction processing. Near jump, near call, conditional near jump targets are split out and compressed separately in another stream.
  • Delta encoding
    Delta encoding
    Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...

     - delta filter, basic preprocessor for multimedia data.


Similar executable pre-processing technology is included in other software; the RAR compressor features displacement compression for 32-bit x86 executables and IA-64 executables, and the UPX
UPX
UPX, the Ultimate Packer for eXecutables, is a free and open source executable packer supporting a number of file formats from different operating systems.- Compression :...

 runtime executable file compressor includes support for working with 16-bit values within DOS
DOS
DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...

 binary files.

Encryption

The 7z format supports encryption
Encryption
In cryptography, encryption is the process of transforming information using an algorithm to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key. The result of the process is encrypted information...

 with the AES
Advanced Encryption Standard
Advanced Encryption Standard is a specification for the encryption of electronic data. It has been adopted by the U.S. government and is now used worldwide. It supersedes DES...

 algorithm with a 256-bit key. The key is generated from a user-supplied passphrase
Passphrase
A passphrase is a sequence of words or other text used to control access to a computer system, program or data. A passphrase is similar to a password in usage, but is generally longer for added security. Passphrases are often used to control both access to, and operation of, cryptographic programs...

 using an algorithm based on the SHA-256 hash function. The SHA-256 is executed 219 (524288) times which causes a significant delay on slow PCs before compression or extraction starts. This technique is called key stretching and is used to make a brute-force search
Brute-force search
In computer science, brute-force search or exhaustive search, also known as generate and test, is a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's...

 for the passphrase more difficult.
The 7z format provides the option to encrypt the filenames of a 7z archive.

Limitations

The 7z format does not store UNIX owner/group permissions, and hence can be inappropriate for backup/archival purposes. A workaround for this is to convert data to a tar bitstream
Tar (file format)
In computing, tar is both a file format and the name of a program used to handle such files...

 before compressing with 7z. But it is worth noting that GNU tar (common in many UNIX environments) can also compress with the LZMA algorithm natively, without the use of 7z, and that in this case the suggested file extension for the archive is ".tar.lzma" (or just ".tlz"), and not ".tar.7z".

The 7z format does not allow extraction of some "broken files" — that is (for example) if one has the first segment of a series of 7z files, 7z cannot give the start of the files within the archive — it must wait until all segments are downloaded. The format 7z also lacks recovery records, which might be a problem when limited file corruption has occurred.

See also

  • Comparison of archive formats
    Comparison of archive formats
    There are many popular computer data archive formats for creating and maintaining archive files. The tables below compare many popular archive formats.-Purpose:The earliest use of archive formats was for backup, mobility, and archiving....

  • List of archive formats
  • Free file format
  • Open format
    Open format
    An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...


External links

  • 7z Format — General description about the 7z archive format.
  • 7-Zip on Sourceforge
    SourceForge
    SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself...

  • Welcome to the 7-Zip Home!
  • 7zip Manual
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK