Burrows-Wheeler transform
Encyclopedia
The Burrows–Wheeler transform (BWT, also called block-sorting compression), is an algorithm
used in data compression
techniques such as bzip2
. It was invented by Michael Burrows
and David Wheeler in 1994 while working at DEC Systems Research Center
in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983.
When a character string is transformed by the BWT, none of its characters change value. The transformation permutes
the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform
and run-length encoding
.
For example:
The output is easier to compress because it has many repeated characters.
In fact, in the transformed string, there are a total of six runs of identical characters:
XX,
SS,
PP,
..,
II,
and
III, which together make 13 out of the 44 characters in it.
all rotations of the text in lexicographic order, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF
' pointer):
The following pseudocode
gives a simple but inefficient way to calculate the BWT and its inverse. It assumes that the input string
function BWT (string s)
create a table, rows are all possible rotations of s
sort rows alphabetically
return (last column of the table)
function inverseBWT (string s)
create empty table
repeat length(s) times
insert s as a column of table before first column of the table // first insert creates first column
sort rows of the table alphabetically
return (row that ends with the 'EOF' character)
To understand why this creates more-easily-compressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be re-generated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
s can make these algorithms run more efficiently without changing the output. In BWT, there is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit
bad worst-case
behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.
adding an 'EOF' marker to the input or, augmenting the output with information, such as an index,
that makes it possible to identify the input string from the class of all of its rotations.
There is a bijective version of the transform, by which the transformed string
uniquely identifies the original. In this version, every string has a unique inverse
of the same length.
The bijective transform is computed by first factoring the input into a non-increasing sequence of Lyndon word
s; such a factorization exists by the Chen–Fox–Lyndon theorem, and can be found in linear time. Then, the algorithm sorts together all the rotations of all of these words; as in the usual Burrows-Wheeler transform, this produces a sorted sequence of n strings.
The transformed string is then obtained by picking the final character of each of these strings in this sorted list.
For example, applying the bijective transform gives:
The bijective transform includes eight runs of identical
characters. These runs are, in order: XX,
II,
XX,
PP,
..,
EE,
..,
and
IIII.
In total, 18 characters
take part in these runs.
implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.
Using the null character
as the end of file marker, and using
The inverse transform repeatedly inserts
Here is another, more efficient method for the inverse transform. Although more complex, it increases the speed greatly when decoding lengthy strings.
BWT in bioinformatics
The advent of high-throughput sequencing (HTS) techniques at the end of the 2000 decade has led to another application of the Burrows–Wheeler transformation. In HTS, DNA
is fragmented into small pieces, of which the first few bases are sequenced
, yielding several millions of "reads", each 30 to 500 base pair
s ("DNA characters") long. In many experiments, e.g., in ChIP-Seq, the task is now to align these reads to a reference genome
, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP, or Maq). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie, BWA, and SOAP2) which use the Burrows–Wheeler transform.
External links
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
used in data compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
techniques such as bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
. It was invented by Michael Burrows
Michael Burrows
Michael Burrows is widely known as the creator of the Burrows–Wheeler transform. He also was, with Louis Monier, one of the two main creators of AltaVista. He did his first degree in Electronic Engineering with Computer Science at University College London...
and David Wheeler in 1994 while working at DEC Systems Research Center
DEC Systems Research Center
The Systems Research Center was a research laboratory created by Digital Equipment Corporation in 1984, in Palo Alto, California....
in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983.
When a character string is transformed by the BWT, none of its characters change value. The transformation permutes
Permutation
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting objects or values. Informally, a permutation of a set of objects is an arrangement of those objects into a particular order...
the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform
Move-to-front transform
The move-to-front transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression...
and run-length encoding
Run-length encoding
Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...
.
For example:
Input | SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES |
---|---|
Output | TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT |
The output is easier to compress because it has many repeated characters.
In fact, in the transformed string, there are a total of six runs of identical characters:
XX,
SS,
PP,
..,
II,
and
III, which together make 13 out of the 44 characters in it.
Example
The transform is done by sortingSorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...
all rotations of the text in lexicographic order, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF
End-of-file
In computing, end of file is a condition in a computer operating system where no more data can be read from a data source...
' pointer):
Transformation | |||
---|---|---|---|
Input | All Rotations |
Sorted List of Rotations |
Output Last Column |
^BANANA@ |
^BANANA@ @^BANANA A@^BANAN NA@^BANA ANA@^BAN NANA@^BA ANANA@^B BANANA@^ |
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA |
BNN^AA@A |
The following pseudocode
Pseudocode
In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...
gives a simple but inefficient way to calculate the BWT and its inverse. It assumes that the input string
s
contains a special character 'EOF' which is the last character, occurs nowhere else in the text, and is ignored during sorting.function BWT (string s)
create a table, rows are all possible rotations of s
sort rows alphabetically
return (last column of the table)
function inverseBWT (string s)
create empty table
repeat length(s) times
insert s as a column of table before first column of the table // first insert creates first column
sort rows of the table alphabetically
return (row that ends with the 'EOF' character)
To understand why this creates more-easily-compressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be re-generated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
Inverse Transformation | |||
---|---|---|---|
Input | |||
BNN^AA@A |
|||
Add 1 | Sort 1 | Add 2 | Sort 2 |
B N N ^ A A @ A |
A A A B N N ^ @ |
BA NA NA ^B AN AN @^ A@ |
AN AN A@ BA NA NA ^B @^ |
Add 3 | Sort 3 | Add 4 | Sort 4 |
BAN NAN NA@ ^BA ANA ANA @^B A@^ |
ANA ANA A@^ BAN NAN NA@ ^BA @^B |
BANA NANA NA@^ ^BAN ANAN ANA@ @^BA A@^B |
ANAN ANA@ A@^B BANA NANA NA@^ ^BAN @^BA |
Add 5 | Sort 5 | Add 6 | Sort 6 |
BANAN NANA@ NA@^B ^BANA ANANA ANA@^ @^BAN A@^BA |
ANANA ANA@^ A@^BA BANAN NANA@ NA@^B ^BANA @^BAN |
BANANA NANA@^ NA@^BA ^BANAN ANANA@ ANA@^B @^BANA A@^BAN |
ANANA@ ANA@^B A@^BAN BANANA NANA@^ NA@^BA ^BANAN @^BANA |
Add 7 | Sort 7 | Add 8 | Sort 8 |
BANANA@ NANA@^B NA@^BAN ^BANANA ANANA@^ ANA@^BA @^BANAN A@^BANA |
ANANA@^ ANA@^BA A@^BANA BANANA@ NANA@^B NA@^BAN ^BANANA @^BANAN |
BANANA@^ NANA@^BA NA@^BANA ^BANANA@ ANANA@^B ANA@^BAN @^BANANA A@^BANAN |
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA |
Output | |||
^BANANA@ |
Optimization
A number of optimizationOptimization (computer science)
In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...
s can make these algorithms run more efficiently without changing the output. In BWT, there is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit
bad worst-case
Best, worst and average case
In computer science, best, worst and average cases of a given algorithm express what the resource usage is at least, at most and on average, respectively...
behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.
Bijective variant
Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted withoutadding an 'EOF' marker to the input or, augmenting the output with information, such as an index,
that makes it possible to identify the input string from the class of all of its rotations.
There is a bijective version of the transform, by which the transformed string
uniquely identifies the original. In this version, every string has a unique inverse
of the same length.
The bijective transform is computed by first factoring the input into a non-increasing sequence of Lyndon word
Lyndon word
In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a string that is strictly smaller in lexicographic order than all of its rotations...
s; such a factorization exists by the Chen–Fox–Lyndon theorem, and can be found in linear time. Then, the algorithm sorts together all the rotations of all of these words; as in the usual Burrows-Wheeler transform, this produces a sorted sequence of n strings.
The transformed string is then obtained by picking the final character of each of these strings in this sorted list.
For example, applying the bijective transform gives:
Input | SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES |
---|---|
Lyndon words | SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES |
Output | STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT |
The bijective transform includes eight runs of identical
characters. These runs are, in order: XX,
II,
XX,
PP,
..,
EE,
..,
and
IIII.
In total, 18 characters
take part in these runs.
Dynamic Burrows–Wheeler transform
Instead of reconstructing the Burrows–Wheeler transform of an edited text, Salson et al. propose an algorithm that deduces the new Burrows–Wheeler transform from the original one, doing a limited number of local reorderings in the original Burrows–Wheeler transform.Sample implementation
This PythonPython (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.
Using the null character
Null character
The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC...
as the end of file marker, and using
s[i:] + s[:i]
to construct the ith rotation of s
, the forward transform takes the last character of each of the sorted rows:The inverse transform repeatedly inserts
r
as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with null, minus the null.Here is another, more efficient method for the inverse transform. Although more complex, it increases the speed greatly when decoding lengthy strings.
BWT in bioinformatics
The advent of high-throughput sequencing (HTS) techniques at the end of the 2000 decade has led to another application of the Burrows–Wheeler transformation. In HTS, DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
is fragmented into small pieces, of which the first few bases are sequenced
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....
, yielding several millions of "reads", each 30 to 500 base pair
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...
s ("DNA characters") long. In many experiments, e.g., in ChIP-Seq, the task is now to align these reads to a reference genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....
, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP, or Maq). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie, BWA, and SOAP2) which use the Burrows–Wheeler transform.
External links
- Compression comparison of BWT based file compressors
- Article by Mark Nelson on the BWT
- A Bijective String-Sorting Transform, by Gil and Scott
- On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner
- Blog post and project page for an open-source compression program and library based on the Burrows–Wheeler algorithm