Longest common substring problem - AbsoluteAstronomy.com

The longest common substring problem is to find the longest string

String (computer science)

In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

(or strings) that is a substring

Substring

A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved...

(or are substrings) of two or more strings. It should not be confused with the longest common subsequence problem

Longest common subsequence problem

The longest common subsequence problem is to find the longest subsequence common to all sequences in a set of sequences . Note that subsequence is different from a substring, see substring vs. subsequence...

. (For an explanation of the difference between a substring and a subsequence, see Substring vs. subsequence).

Example

The longest common substrings of the strings "ABAB", "BABA" and "ABBA" are the strings "AB" and "BA" of length 2. Other common substrings are "A", and "B".

ABAB
|||
BABA
||
ABBA

Problem definition

Given two strings,

of length

and

of length

, find the longest strings which are substrings of both

and

.

A generalisation is the k-common substring problem. Given the set of strings

, where

and Σ

. Find for each 2 ≤

≤

, the longest strings which occur as substrings of at least

strings.

Algorithms

One can find the lengths and starting positions of the longest common substrings of

and

with the help of a generalised suffix tree

Generalised suffix tree

A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings D=S_1,S_2,\dots,S_d of total length n, it is a Patricia tree containing all n suffixes of the strings...

. Finding them by dynamic programming

Dynamic programming

In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

costs

. The solutions to the generalised problem take

and

·...·

time.

Suffix tree

The longest common substrings of a set of strings can be found by building a generalised suffix tree

Generalised suffix tree

A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings D=S_1,S_2,\dots,S_d of total length n, it is a Patricia tree containing all n suffixes of the strings...

for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it. The figure on the right is the suffix tree for the strings "ABAB", "BABA" and "ABBA", padded with unique string terminators, to become "ABAB$0", "BABA$1" and "ABBA$2". The nodes representing "A", "B", "AB" and "BA" all have descendant leaves from all of the strings, numbered 0, 1 and 2.

Building the suffix tree takes

time (if the size of the alphabet is constant). If the tree is traversed from the bottom up with a bit vector telling which strings are seen below each node, the k-common substring problem can be solved in

time. If the suffix tree is prepared for constant time lowest common ancestor

Lowest common ancestor

The lowest common ancestor is a concept in graph theory and computer science. Let T be a rooted tree with n nodes. The lowest common ancestor is defined between two nodes v and w as the lowest node in T that has both v and w as descendants .The LCA of v and w in T is the shared ancestor of v...

retrieval, it can be solved in

time.

Dynamic programming

First find the longest common suffix for all pairs of prefixes of the strings. The longest common suffix is

For the example strings "ABAB" and "BABA":

	\|A	\|B	\|A	\|B
	0	0	0	0
B	0	1	0	1
A	1	0	2	0
B	0	2	0	3
A	1	0	3	0

The maximal of these longest common suffixes of possible prefixes must be the longest common substrings of S and T. These are shown on diagonals, in red, in the table. For this example, the longest common substrings are "BAB" and "ABA".

This can be extended to more than two strings by adding more dimensions to the table.

Pseudocode

The following pseudocode finds the set of longest common substrings between two strings with dynamic programming:

function LCSubstr(S[1..m], T[1..n])
L := array(1..m, 1..n)
z := 0
ret := {}
for i := 1..m
for j := 1..n
if S[i] = T[j]
if i = 1 or j = 1
L[i,j] := 1
else
L[i,j] := L[i-1,j-1] + 1
if L[i,j] > z
z := L[i,j]
ret := {}
if L[i,j] = z
ret := ret ∪ {S[i-z+1..z]}
else L[i,j]=0;
return ret

This algorithm runs in

time. The variable z is used to hold the length of the longest common substring found so far. The set ret is used to hold the set of strings which are of length z. The set ret can be saved efficiently by just storing the index i, which is the last character of the longest common substring (of size z) instead of S[i-z+1..z]. Thus all the longest common substrings would be, for each i in ret, S[(ret[i]-z)..(ret[i])].

The following tricks can be used to reduce the memory usage of an implementation:

Keep only the last and current row of the DP table to save memory ( instead of )
Store only non-zero values in the rows. This can be done using hash tables instead of arrays. This is useful for large alphabets.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.