Hirschberg's algorithm
Encyclopedia
Hirschberg's algorithm, named after its inventor, Dan Hirschberg
Dan Hirschberg
Daniel S. Hirschberg is a full professor in Computer Science at University of California, Irvine. His research interests are in the theory of design and analysis of algorithms....

, is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

 algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 that finds the least cost sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

 between two string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

s, where cost is measured as Levenshtein distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string to the other. Hirschberg's algorithm is simply described as a divide and conquer
Divide and conquer algorithm
In computer science, divide and conquer is an important algorithm design paradigm based on multi-branched recursion. A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same type, until these become simple enough to be solved directly...

 version of the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...

. Hirschberg's algorithm is commonly used in computational biology
Computational biology
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...

 to find maximal global alignments of DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 and protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

 sequences.

Algorithm information

Hirschberg's algorithm is a generally applicable algorithm for finding an optimal sequence alignment. BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

 and FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...

 are suboptimal heuristics. If x and y are strings, where length(x) = n and length(y) = m, the Needleman-Wunsch algorithm
Needleman-Wunsch algorithm
The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...

 finds an optimal alignment in O
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

(nm) time, using O(nm) space. Hirschberg's algorithm is a clever modification of the Needleman-Wunsch Algorithm which still takes O(nm) time, but needs only O(min{n,m}) space.

One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence
Longest common subsequence problem
The longest common subsequence problem is to find the longest subsequence common to all sequences in a set of sequences . Note that subsequence is different from a substring, see substring vs. subsequence...

 between two sets of data such as with the common diff
Diff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...

 tool.

Computation of the Levenshtein edit distance in linear space

The edit distance Edit(x,y) is the least cost of changing x into y by using the operations Insert, Substitute, and Delete, where the cost of each kind of operation is given. We write Ins(a) for the cost of
inserting a symbol a into a string, Sub(a,b) for the cost of substituting a symbol b for a in a string, and Del(a) for the cost of deleting a symbol a from a string. The "standard" choice of those costs is Ins(a) = Del(a) = 1 for each symbol a, Sub(a,a) = 0, and Sub(a,b) = 1 if a is not equal to b.
The Needleman-Wunsch algorithm computes Edit(x,y) by computing Edit(Prefix[x,i],Prefix[y,j]) for all i and j dynamically, where Prefix[x,i] denotes the prefix of x of length i. That algorithm requires O(nm) time and O(nm) space, where n = length(x) and m = length(y).

Algorithm organization

To understand Hirschberg's algorithm, it is important to first understand
that edit distances can be computed using linear space.

What we call the "forward subprogram" computes the values of
Edit(Prefix[x,i],Prefix[y,j]) for all i and j, just as the Needleman-Wunsch
and returns the array {Edit(x,Prefix[y,j])}0 ≤ j ≤ m. The
"backward subprogram" is similar, except that the dynamic program
is done in the opposite direction, i.e., starting from the right ends
of the strings. It returns the array {Edit(x,Suffix[y,j])}0 ≤ j ≤ m,
where Suffix[y,j] is the suffix of y of length j.

The forward and backward subprograms compute values in the matrix by using
previously computed values, but save space by erasing values that will no
longer be needed during that execution of the subprogram. Unfortunately,
the erased values will need to be recomputed later; thus, Hirschberg's
algorithm takes roughly twice as much time as Needleman-Wunsch.

Pseudocode

High-level description of the forwards subprogram

Forwards[x,y] is

1. n = length(x); m = length(y)
2. Edit[Prefix[0,0]] = 0;
3. For all j from 1 to m:
Edit[Prefix[x,0],Prefix[y,j]] = Edit[Prefix[x,0],Prefix[y,j-1]] + Ins(y_j)
4. For all i from 1 to n:
A. Edit[Prefix[x,i],Prefix[y,0]] = Edit[Prefix[x,i-1],Prefix[y,0]] + Del(x_i)
B. For all j from 1 to m, execute the following steps:
i. Edit[Prefix[x,i],Prefix[y,j]] =
min{Edit[Prefix[x,i-1],Prefix[y,j]] + Del(x_i),
Edit[Prefix[x,i-1],Prefix[y,j-1]] + Sub(x_i,y_j),
Edit[Prefix[x,i],Prefix[y,j-1]] + Ins(y_j)}
ii. Erase Edit[Prefix[x,i-1],Prefix[y,j-1]]
C. Erase Edit[Prefix[x,i-1],Prefix[y,m]]
5. RETURN Edit[x] %% an array of length m+1


High-level description of the backwards subprogram


Backwards[x,y] is

1. n = length(x); m = length(y)
2. Edit[Suffix[0,0]] = 0;
3. For all j from 1 to m:
Edit[Suffix[x,0],Suffix[y,j]] = Edit[Suffix[x,0].Suffix[y,j-1]] + Ins(y_{m-j+1});
4. For all i from 1 to n:
A. Edit[Suffix[x,i],Suffix[y,0]] = Edit[Suffix[x,i-1],Suffix[y,0]] + Del(x_{n-i-1})
B. For all j from 1 to m:
i. Edit[Suffix[x,i],Suffix[y,j]] =
min{Edit[Suffix[x,i-1],Suffix[y,j]] + Del(x_{n-i-1}),
Edit[Suffix[x,i-1],Suffix[y,j-1]] + Sub(x_{n-i-1},y_{m-j+1}),
Edit[Suffix[x,i],Suffix[y,j-1]] + Ins(y_{m-j+1})}
ii. Erase Edit[Suffix[x,i-1],Suffix[y,j-1]]
C. Erase Edit[Suffix[x,i-1],Suffix[y,m]]
5. RETURN Edit[x] %% an array of length m+1


High level description of Hirschberg's algorithm:


Hirschberg(x,y) is

1. n = length(x); m = length(y)
2. If n <= 1 or m <= 1:
OUTPUT Alignment(x,y) using Needleman-Wunsch.
Else:
A. mid = floor(n/2)
B. x_left = Prefix[x,mid]
C. x_right = Suffix[x,n-mid]
D. Edit[x_left] = Forwards(x_left,y) %% an array of length m+1
E. Edit[x_right] = Backwards(x_right,y) %% an array of length m+1
F. cut = ArgMin{Edit[x_left,Prefix[y,j]] + Edit[x_right,Suffix[y,m-j]]} %% j ranges from 1 to m-1
G. Hirschberg(x_left,Prefix[y,cut])
H. Hirschberg(x_right,Suffix[y,m-cut])

See also

  • Needleman-Wunsch algorithm
    Needleman-Wunsch algorithm
    The Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...

  • Smith Waterman algorithm
    Smith Waterman algorithm
    The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences...

  • Levenshtein distance
    Levenshtein distance
    In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

  • Longest Common Subsequence
    Longest common subsequence problem
    The longest common subsequence problem is to find the longest subsequence common to all sequences in a set of sequences . Note that subsequence is different from a substring, see substring vs. subsequence...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK