Tar (file format)
Encyclopedia
In computing
, tar (derived from tape archive) is both a file format
(in the form of a type of archive bitstream
) and the name of a program used to handle such files. The format was created in the early days of Unix
and standardized by POSIX
.1-1988 and later POSIX.1-2001.
Initially developed to be written directly to sequential I/O
devices for tape backup purposes, it is now commonly used to collect many files into one larger file for distribution or archiving, while preserving file system
information such as user and group permissions, dates, and directory
structures.
, tar files (
rather than piecemeal. Applying a compression
utility such as gzip
, bzip2
, xz
, lzip
, lzma
, or compress
to a tar file produces a compressed tar file, typically named with an extension indicating the type of compression (e.g.,
Popular tar programs like the BSD and GNU
versions of tar support the command line options
tar from version 1.20 onwards also supports
). 1.21 also supports lzop
via
via
via
MS-DOS
's 8.3 filename limitations, and a desire for brevity, resulted in additional popular conventions for naming compressed tar archives, though this practice has declined with FAT offering long filenames.
A tar file or compressed tar file is commonly referred to as a tarball.
record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes and the extra space is zero filled. The end of an archive is marked by at least two consecutive zero-filled records. (The origin of tar's record size appears to be the 512-byte disk sectors used in the Version 7 Unix file system.)
Many historic tape drives read and write variable-length data blocks, leaving significant wasted space on the tape between blocks (for the tape to physically start and stop moving). Some tape drives (and raw disks) only support fixed-length data blocks. Also, when writing to any medium such as a filesystem or network, it takes less time to write one large block than many small blocks. Therefore, the tar command writes data in blocks of many 512 byte records. The user can specify a blocking factor, which is the number of records per block; the default is 20, producing 10 kilobyte blocks (which was large when UNIX was invented, but now seems rather small). The final block of an archive is padded out to full length with zero bytes.
about a file. To ensure portability across different architectures with different byte orderings
, the information in the header block is encoded in ASCII
. Thus if all the files in an archive are text files, and have ASCII
names, then the archive is essentially an ASCII text file (containing many NUL characters).
Subsequent extensions have diluted this original attribute (which was never particularly valuable, due to the presence of NUL
s, which are often poorly handled by text-manipulation programs).
The fields defined by the original Unix tar format are listed in the table below. The link indicator/file type table includes some modern extensions. When a field is unused it is filled with NUL bytes. The header is padded with NUL bytes to make it fill a 512 byte block.
The Link indicator field can have the following values:
A directory is also indicated by having a trailing slash (/) in the name.
Numeric values are encoded in octal
numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL
or space
character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabyte
s on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.
The checksum
is calculated by taking the sum of the unsigned byte values of the header block with the eight checksum bytes taken to be ascii spaces (decimal value 32). It is stored as a six digit octal number with leading zeroes followed by a NUL and then a space. Various implementations do not adhere to this, so relying on the first white space trimmed six digits for checksum yields better compatibility. In addition, some historic tar implementations treated bytes as signed. Users typically calculate the checksum both ways, and treat it as good if either the signed or unsigned sum matches the included checksum.
Unix filesystems support multiple links (names) for the same file. If several such files appear in a tar archive, only the first one is archived as a normal file; the rest are archived as hard links, with the "name of linked file" field set to the first one's name. On extraction, such hard links should be recreated in the file system.
s.
In 2001, star introduced support for ACLs and extended attributes. Later, major Linux distributions created their own patched
versions of GNU tar that fully support ACL.
A tarpipe is the process of creating a tar archive on stdout and then, in another directory, extracting the tar file from the piped stdin. This is one way to copy directories and subdirectories, even if the directories contain special files, such as symlinks, and character or block devices. Example Tarpipe:
tar -c "$srcdir" | tar -C "$destdir" -xv
Tarpipes are inefficient because of the overhead of the second tar invocation (which is especially expensive on DOS/Windows when using command.com
, as it is buffered to a file. But that isn't common. Since cmd.exe was introduced, parallel pipe are supported). Also, pipes may cause problems in automated scripts, as the success or failure of each command cannot be verified easily.
made available in .tar.gz (compressed tar archives), and its use featuring as a basis of other types of containment and distribution, such as in Debian's
.DEB
archives, which are also used in Ubuntu
.
In 1997, a method for adding unlimited extensions to the tar format has been proposed by Sun and later accepted for the POSIX.1-2001 standard. This format is known as extended tar-format or pax-format. The new tar format allows to add any type of vendor tagged vendor specific enhancements. The following enhancement tags are defined by the POSIX standard:
Star was the first tar implementation to support this enhanced archive format in 2001, GNU tar added support in 2004.
Other formats have been created to address the shortcomings of tar, such as DAR (Disk Archiver)
and rdiff-backup
(see Duplicity branch of the Savannah software site), these formats are however not part of any official standard.
operating systems usually include tools to support tar files, as well as utilities commonly used to compress them, such as gzip
and bzip2
. In contrast, Microsoft Windows
includes neither graphical nor command-line tools for reading or creating these formats, requiring the use of third-party tools like 7-Zip
.
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, tar (derived from tape archive) is both a file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
(in the form of a type of archive bitstream
Bitstream
A bitstream or bit stream is a time series of bits.A bytestream is a series of bytes, typically of 8 bits each, and can be regarded as a special case of a bitstream....
) and the name of a program used to handle such files. The format was created in the early days of Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
and standardized by POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
.1-1988 and later POSIX.1-2001.
Initially developed to be written directly to sequential I/O
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
devices for tape backup purposes, it is now commonly used to collect many files into one larger file for distribution or archiving, while preserving file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...
information such as user and group permissions, dates, and directory
Directory (file systems)
In computing, a folder, directory, catalog, or drawer, is a virtual container originally derived from an earlier Object-oriented programming concept by the same name within a digital file system, in which groups of computer files and other folders can be kept and organized.A typical file system may...
structures.
Compression and naming
Conventionally, uncompressed tar archive files have names ending in ".tar". Unlike ZIP archivesZIP (file format)
Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...
, tar files (
somefile.tar
) are commonly compressed as a wholeSolid compression
In computing, solid compression refers to a method for data compression of multiple files, wherein all the compressed files are concatenated and treated as a single data block. Such an archive is called a solid archive. It is used natively in the 7z and RAR formats, as well as indirectly in...
rather than piecemeal. Applying a compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
utility such as gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...
, bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
, xz
Xz
xz is a lossless data compression file format incorporating the LZMA2 compression algorithm. Like gzip and bzip2, concatenation is supported to compress multiple files, but the convention is to bundle a file that is an archive itself, such as those created by the tar or cpio Unix...
, lzip
Lzip
lzip is a free, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm with a user interface that is familiar to users of usual Unix compression tools, such as gzip and bzip2....
, lzma
LZMA
The Lempel–Ziv–Markov chain algorithm is an algorithm used to perform data compression. It has been under development since 1998 and was first used in the 7z format of the 7-Zip archiver...
, or compress
Compress
Compress is a UNIX compression program based on the LZC compression method, which is an LZW implementation using variable size pointers as in LZ78.- Description of program :Files compressed by compress are typically given the extension .Z...
to a tar file produces a compressed tar file, typically named with an extension indicating the type of compression (e.g.,
somefile.tar.gz
).Popular tar programs like the BSD and GNU
GNU
GNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...
versions of tar support the command line options
-z
(gzip), and -j
(bzip2) to automatically compress or decompress the archive file it is currently working with. GNUGNU
GNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...
tar from version 1.20 onwards also supports
--lzma
(LZMALZMA
The Lempel–Ziv–Markov chain algorithm is an algorithm used to perform data compression. It has been under development since 1998 and was first used in the 7z format of the 7-Zip archiver...
). 1.21 also supports lzop
Lzop
lzop is a free software file compression tool which uses LZO and is under the GPL license.Aimed at being very fast, lzop produces files a bit larger than gzip's but with less than a tenth of the CPU use. On many benchmark sites lzop is one of the fastest compressors available...
via
--lzop
, 1.22 adds support for xzXz
xz is a lossless data compression file format incorporating the LZMA2 compression algorithm. Like gzip and bzip2, concatenation is supported to compress multiple files, but the convention is to bundle a file that is an archive itself, such as those created by the tar or cpio Unix...
via
--xz
or -J
, and 1.23 adds support for lzipLzip
lzip is a free, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm with a user interface that is familiar to users of usual Unix compression tools, such as gzip and bzip2....
via
--lzip
.MS-DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
's 8.3 filename limitations, and a desire for brevity, resulted in additional popular conventions for naming compressed tar archives, though this practice has declined with FAT offering long filenames.
-
.tgz
is equivalent to.tar.gz
-
.tbz
and.tb2
are equivalent to.tar.bz2
-
.taz
is equivalent to.tar.Z
-
.tlz
is equivalent to.tar.lz
-
.txz
is equivalent to.tar.xz
A tar file or compressed tar file is commonly referred to as a tarball.
Format details
A tar file is the concatenation of one or more files. Each file is preceded by a 512-byte headerHeader (information technology)
In information technology, header refers to supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header are sometimes called the payload or body....
record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes and the extra space is zero filled. The end of an archive is marked by at least two consecutive zero-filled records. (The origin of tar's record size appears to be the 512-byte disk sectors used in the Version 7 Unix file system.)
Many historic tape drives read and write variable-length data blocks, leaving significant wasted space on the tape between blocks (for the tape to physically start and stop moving). Some tape drives (and raw disks) only support fixed-length data blocks. Also, when writing to any medium such as a filesystem or network, it takes less time to write one large block than many small blocks. Therefore, the tar command writes data in blocks of many 512 byte records. The user can specify a blocking factor, which is the number of records per block; the default is 20, producing 10 kilobyte blocks (which was large when UNIX was invented, but now seems rather small). The final block of an archive is padded out to full length with zero bytes.
File header
The file header block contains metadataMetadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
about a file. To ensure portability across different architectures with different byte orderings
Endianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...
, the information in the header block is encoded in ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
. Thus if all the files in an archive are text files, and have ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
names, then the archive is essentially an ASCII text file (containing many NUL characters).
Subsequent extensions have diluted this original attribute (which was never particularly valuable, due to the presence of NUL
Null character
The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC...
s, which are often poorly handled by text-manipulation programs).
The fields defined by the original Unix tar format are listed in the table below. The link indicator/file type table includes some modern extensions. When a field is unused it is filled with NUL bytes. The header is padded with NUL bytes to make it fill a 512 byte block.
Field Offset | Field Size | Field |
---|---|---|
0 | 100 | File name |
100 | 8 | File mode |
108 | 8 | Owner's numeric user ID |
116 | 8 | Group's numeric user ID |
124 | 12 | File size in bytes |
136 | 12 | Last modification time in numeric Unix time format |
148 | 8 | Checksum for header block |
156 | 1 | Link indicator (file type) |
157 | 100 | Name of linked file |
The Link indicator field can have the following values:
Value | Meaning |
---|---|
'0' | Normal file |
(ASCII ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text... NUL Null character The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC... ) |
Normal file (now obsolete) |
'1' | Hard link Hard link In computing, a hard link is a directory entry that associates a name with a file on a file system. . The term is used in file systems which allow multiple hard links to be created for the same file. This has the effect of creating multiple names for the same file, causing an aliasing effect: e.g... |
'2' | Symbolic link Symbolic link In computing, a symbolic link is a special type of file that contains a reference to another file or directory in the form of an absolute or relative path and that affects pathname resolution. Symbolic links were already present by 1978 in mini-computer operating systems from DEC and Data... |
'3' | Character special |
'4' | Block special |
'5' | Directory |
'6' | FIFO Named pipe In computing, a named pipe is an extension to the traditional pipe concept on Unix and Unix-like systems, and is one of the methods of inter-process communication. The concept is also found in Microsoft Windows, although the semantics differ substantially... |
'7' | Contiguous file |
A directory is also indicated by having a trailing slash (/) in the name.
Numeric values are encoded in octal
Octal
The octal numeral system, or oct for short, is the base-8 number system, and uses the digits 0 to 7. Numerals can be made from binary numerals by grouping consecutive binary digits into groups of three...
numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL
Null character
The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC...
or space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....
character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabyte
Gigabyte
The gigabyte is a multiple of the unit byte for digital information storage. The prefix giga means 109 in the International System of Units , therefore 1 gigabyte is...
s on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.
The checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
is calculated by taking the sum of the unsigned byte values of the header block with the eight checksum bytes taken to be ascii spaces (decimal value 32). It is stored as a six digit octal number with leading zeroes followed by a NUL and then a space. Various implementations do not adhere to this, so relying on the first white space trimmed six digits for checksum yields better compatibility. In addition, some historic tar implementations treated bytes as signed. Users typically calculate the checksum both ways, and treat it as good if either the signed or unsigned sum matches the included checksum.
Unix filesystems support multiple links (names) for the same file. If several such files appear in a tar archive, only the first one is archived as a normal file; the rest are archived as hard links, with the "name of linked file" field set to the first one's name. On extraction, such hard links should be recreated in the file system.
UStar format
Most modern tar programs read and write archives in the UStar (Uniform Standard Tape Archive) format, introduced by the POSIX IEEE P1003.1 standard from 1988. It introduced additional header fields. Older tar programs will ignore the extra information, while newer programs will test for the presence of the "ustar" string to determine if the new format is in use. The UStar format allows for longer file names and stores extra information about each file.Field Offset | Field Size | Field |
---|---|---|
0 | 156 | (as in old format) |
156 | 1 | Type flag |
157 | 100 | (as in old format) |
257 | 6 | UStar indicator "ustar" |
263 | 2 | UStar version "00" |
265 | 32 | Owner user name |
297 | 32 | Owner group name |
329 | 8 | Device major number |
337 | 8 | Device minor number |
345 | 155 | Filename prefix |
Tarpipe
Many old tar implementations (such as GNU tar) do not record extended attributes (xattrs) or ACLAccess control list
An access control list , with respect to a computer file system, is a list of permissions attached to an object. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects. Each entry in a typical ACL specifies a subject...
s.
In 2001, star introduced support for ACLs and extended attributes. Later, major Linux distributions created their own patched
Patch (computing)
A patch is a piece of software designed to fix problems with, or update a computer program or its supporting data. This includes fixing security vulnerabilities and other bugs, and improving the usability or performance...
versions of GNU tar that fully support ACL.
A tarpipe is the process of creating a tar archive on stdout and then, in another directory, extracting the tar file from the piped stdin. This is one way to copy directories and subdirectories, even if the directories contain special files, such as symlinks, and character or block devices. Example Tarpipe:
tar -c "$srcdir" | tar -C "$destdir" -xv
Tarpipes are inefficient because of the overhead of the second tar invocation (which is especially expensive on DOS/Windows when using command.com
COMMAND.COM
COMMAND.COM is the filename of the default operating system shell for DOS operating systems and the default command line interpreter on Windows 95, Windows 98 and Windows Me...
, as it is buffered to a file. But that isn't common. Since cmd.exe was introduced, parallel pipe are supported). Also, pipes may cause problems in automated scripts, as the success or failure of each command cannot be verified easily.
Software distribution
The tar format continues to be used extensively in Linux and Windows, as it is part of the GNU project and comes as part of every distribution of GNU/Linux. Linux versions use features prominently in various software distributions, with most software source codeSource code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
made available in .tar.gz (compressed tar archives), and its use featuring as a basis of other types of containment and distribution, such as in Debian's
Debian
Debian is a computer operating system composed of software packages released as free and open source software primarily under the GNU General Public License along with other free software licenses. Debian GNU/Linux, which includes the GNU OS tools and Linux kernel, is a popular and influential...
.DEB
Deb (file format)
deb is the extension of the Debian software package format and the most often used name for such binary packages. Like the "Deb" part of the term Debian, it originates from the name of Debra, erstwhile girlfriend and now ex-wife of Debian's founder Ian Murdock.Debian packages are also used in...
archives, which are also used in Ubuntu
Ubuntu (operating system)
Ubuntu is a computer operating system based on the Debian Linux distribution and distributed as free and open source software. It is named after the Southern African philosophy of Ubuntu...
.
Problems and limitations
The original tar format was created in the early days of UNIX, and despite current widespread use, many of its design features are considered dated.In 1997, a method for adding unlimited extensions to the tar format has been proposed by Sun and later accepted for the POSIX.1-2001 standard. This format is known as extended tar-format or pax-format. The new tar format allows to add any type of vendor tagged vendor specific enhancements. The following enhancement tags are defined by the POSIX standard:
- all three time stamps of a file in arbitrary resolution (most implementations use nanosecond granularity)
- path names of unlimited length and characterset coding
- symlink target names of unlimited length and characterset coding
- user and group names of unlimited length and character set coding
- files with unlimited size (the historic tar format is limited to 8 GB)
- userid and groupid without limitation (this historic tar format was is limited to a max. id of 2097151)
- a characterset definition for path names and user/group names
Star was the first tar implementation to support this enhanced archive format in 2001, GNU tar added support in 2004.
Other formats have been created to address the shortcomings of tar, such as DAR (Disk Archiver)
DAR (Disk Archiver)
DAR is a command-line archiving tool and a replacement for tar.It features:*Support for slices, archives split over multiple files of a particular size.*Option of deleting files from the system which are removed in the archive....
and rdiff-backup
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
(see Duplicity branch of the Savannah software site), these formats are however not part of any official standard.
Operating system support
Unix-likeUnix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
operating systems usually include tools to support tar files, as well as utilities commonly used to compress them, such as gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...
and bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
. In contrast, Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
includes neither graphical nor command-line tools for reading or creating these formats, requiring the use of third-party tools like 7-Zip
7-Zip
7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively...
.
Tarbomb
A tarbomb is derogatory hacker slang used to refer to a tarball that does not follow the usual conventions, i.e. it contains many files that extract into the working directory. Such a tarball can create problems by overwriting files of the same name in the working directory, or mixing one project's files into another. It is almost always an inconvenience to the user, who is obliged to identify and delete a number of files scattered throughout the directory's contents. Such behavior is considered bad etiquette on the part of the archive's creator.A related problem is the use of absolute paths or parent directory
Parent directory
In computing, the parent directory of a given directory A is the directory B in which A is located. In As absolute path, B is the predecessor of A....
references when creating tarballs. Files extracted from such tarballs will often be created in unusual locations outside the working directory and, like a tarbomb, have the potential to overwrite existing files. GNU tar by default refuses to create or extract absolute paths, but is still vulnerable to parent-directory references. (The bsdtar program, which is available on a number of operating systems and is the default "tar" on Mac OS X v10.6, also ordinarily refuses to follow parent-directory references or symlinks.)
A command line user can avoid both of these problems by first examining a tar file with a command like
tar -tf archive.tar
. This does not extract any files, but displays the names of all files in the archive. If any are problematic, the user can create a new empty directory and extract the tarball into it—or avoid the tarball entirely. Most graphical tools can display the contents of the archive before extracting them.Random access
Another weakness of the tar format compared to other archive formats is that there is no centralized location for the information about the contents of the file (a "table of contents" of sorts). So to list the names of the files that are in the archive, one must read through the entire archive and look for places where files start. Also, to extract one small file from the archive, instead of being able to lookup the offset in a table and go directly to that location, like other archive formats, with tar, one has to read through the entire archive, looking for the place where the desired file starts. For large tar archives, this causes a big performance penalty, making tar archives unsuitable for situations that often require random access of individual files.The possible reason for not using a centralized location of information is rooted in the fact that tar was originally meant for tapes, which are bad at random access anyway: if the TOC were at the start of the archive, creating it would mean to first calculate all the positions of all files, which either needs doubled work, a big cache, or rewinding the tape after writing everything to write the TOC. On the other hand, if the TOC were at the end-of-file (as is the case with ZIP files, for example), reading the TOC would require that the tape be wound to the end, also taking up time and degrading the tape by excessive wear and tear. Compression further complicates matters, as calculating compressed positions for a TOC at the start would need compression of everything before writing the TOC, a TOC with uncompressed positions is not really useful (since you have to decompress everything anyway to get the right positions) and decompressing a TOC at the end of the file might require decompressing the whole file anyway, too.
Key implementations
There have been historically many implementations of tar, and many general file archiversComparison of file archivers
The following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information. They are neither all-inclusive nor are some entries necessarily up to date...
have at least partial support for tar (often using one of the implementations below). Most tar implementations can also read and create cpio
Cpio
cpio is a general file archiver utility and its associated file format. It is primarily installed on Unix-like computer operating systems. The software utility was originally intended as a tape archiving program as part of the Programmer's Workbench , and has been a component of virtually every...
and pax (the latter actually is a tar-format with POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
-2001-extensions).
- FreeBSDFreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
tar (also BSD tar) is the default tar on most Berkeley Software DistributionBerkeley Software DistributionBerkeley Software Distribution is a Unix operating system derivative developed and distributed by the Computer Systems Research Group of the University of California, Berkeley, from 1977 to 1995...
-based operating systems including Mac OS XMac OS XMac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
. The core functionality is available as libarchive for inclusion in other applications. - GNUGNUGNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...
tar is the default on most LinuxLinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
distributions. It is based on the public domain implementation pdtar which started in 1987. It can use various formats, including ustar, pax, GNU and v7 formats. - SolarisSolaris- Fiction :* Solaris , a 1961 science fiction novel by Stanisław Lem** Solaris , directed by B. Nirenburg** Solaris , directed by Andrei Tarkovsky** Solaris , directed by Steven Soderbergh...
tar is based on the original UNIX V7 tar and is the default on the Solaris operating system. - star (unique standard tape archiver) was created in 1982 by Jörg SchillingJörg SchillingJörg Schilling is a computer programmer who has worked extensively on compact disc burning software, the Solaris Operating System and the OpenSolaris project. He studied originally as an electrical engineer at the Technical University of Berlin...
and is published under the CDDLCommon Development and Distribution LicenseCommon Development and Distribution License is a free software license, produced by Sun Microsystems, based on the Mozilla Public License , version 1.1....
-license. In 1999 was found to be fastest tar implementation.
Additionally, most pax implementations can read and create many types of tar files.
Adoption rates
Some tar implementations are installed by default on most UNIX-like operating systems and often required for system operation. For example 99.99% of all Ubuntu Linux computers have GNU tar installed, with BSD tar and star at 0.05% and 0.04%. libarchive, the reusable part of BSD tar, is installed on 76.4% of Ubuntu computers, since it is used by various applications including parts of GNOMEGNOME
GNOME is a desktop environment and graphical user interface that runs on top of a computer operating system. It is composed entirely of free and open source software...
, the default desktop environment on Ubuntu. On Debian, libarchive1 is installed on 35.8% of computers. On BSD and Solaris operating systems, these numbers are expected to look completely different since they include a different tar by default. As such, the numbers mostly reflect the defaults used by these operating systems and desktop environments.
Since most general file archives on other platforms, including Windows, also support the tar archive format, this format can be considered universally usable on any computer platform.
See also
- List of archive formats
- Comparison of file archiversComparison of file archiversThe following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information. They are neither all-inclusive nor are some entries necessarily up to date...
- List of Unix programs