String metric
Encyclopedia
In mathematics
, string metrics (also known as similarity metrics) are a class of metric
that measure similarity or dissimilarity (distance
) between two text strings
for approximate string matching or comparison and in fuzzy string searching. For example the strings "Sam" and "Samuel" can be considered to be similar. A string metric provides a number indicating an algorithm-specific indication of similarity.
The most widely known string metric is a rudimentary one called the Levenshtein Distance
(also known as Edit Distance). It operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance
have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.
A widespread example of a string metric is DNA
sequence analysis
and RNA analysis, which are performed by optimised string metrics to identify matching sequences.
String metrics are used heavily in information integration
and are currently used in area including fraud detection, fingerprint analysis, plagiarism detection
, ontology merging
, DNA analysis, RNA analysis, image analysis
, evidence-based machine learning, database
data deduplication
, data mining
, Web interfaces, e.g. Ajax
-style suggestions as you type, data integration
, and semantic knowledge integration
.
Mathematics
Mathematics is the study of quantity, space, structure, and change. Mathematicians seek out patterns and formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proofs, which are arguments sufficient to convince other mathematicians of their validity...
, string metrics (also known as similarity metrics) are a class of metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
that measure similarity or dissimilarity (distance
Distance
Distance is a numerical description of how far apart objects are. In physics or everyday discussion, distance may refer to a physical length, or an estimation based on other criteria . In mathematics, a distance function or metric is a generalization of the concept of physical distance...
) between two text strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
for approximate string matching or comparison and in fuzzy string searching. For example the strings "Sam" and "Samuel" can be considered to be similar. A string metric provides a number indicating an algorithm-specific indication of similarity.
The most widely known string metric is a rudimentary one called the Levenshtein Distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
(also known as Edit Distance). It operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.
A widespread example of a string metric is DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
sequence analysis
Sequence analysis
In bioinformatics, the term sequence analysis refers to the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological...
and RNA analysis, which are performed by optimised string metrics to identify matching sequences.
String metrics are used heavily in information integration
Information integration
Information integration is the merging of information from disparate sources with differing conceptual, contextual and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured resources...
and are currently used in area including fraud detection, fingerprint analysis, plagiarism detection
Plagiarism detection
Plagiarism detection is the process of locating instances of plagiarism within a work or document. The widespread use of computers and the advent of the Internet has made it easier to plagiarize the work of others. Most cases of plagiarism are found in academia, where documents are typically essays...
, ontology merging
Ontology merging
Ontology merging defines the act of bringing together two conceptually divergent ontologies or the instance data associated to two ontologies. This is similar to work in database merging . This merging process can be performed in a number of ways, manually, semi automatically, or automatically...
, DNA analysis, RNA analysis, image analysis
Image analysis
Image analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques...
, evidence-based machine learning, database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
data deduplication
Data deduplication
In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link...
, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, Web interfaces, e.g. Ajax
Ajax (programming)
Ajax is a group of interrelated web development methods used on the client-side to create asynchronous web applications...
-style suggestions as you type, data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
, and semantic knowledge integration
Knowledge integration
Knowledge integration is the process of synthesizing multiple knowledge models into a common model .Compared to information integration, which involves merging information having different schemas and representation models, knowledge integration focuses more on synthesizing the understanding of a...
.
List of string metrics
- Hamming distanceHamming distanceIn information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...
- Levenshtein distanceLevenshtein distanceIn information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
and Damerau–Levenshtein distance - Needleman–WunschNeedleman–Wunsch algorithmThe Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
distance or Sellers' algorithm - Smith–Waterman distance
- FASTAFASTAFASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
- BLASTBLASTIn bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
- FASTA
- Gotoh distance or Smith-Waterman-Gotoh distance
- Monge Elkan distance
- Block distance or L1 distance or City block distance
- Jaro–Winkler distance
- SoundexSoundexSoundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless...
distance metric - Matching coefficient
- Dice's coefficient
- Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
- Tversky indexTversky indexThe Tversky index, named after Amos Tversky, is an asymmetric similarity measure that compares a variant to a prototype. The Tversky index can be seen as a generalization of Dice's coefficient and Tanimoto coefficient....
- Overlap coefficient
- Euclidean distanceEuclidean distanceIn mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
or L2 distance - Cosine similarityCosine similarityCosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same...
- Variational distance
- Hellinger distanceHellinger distanceIn probability and statistics, the Hellinger distance is used to quantify the similarity between two probability distributions. It is a type of f-divergence...
or Bhattacharyya distanceBhattacharyya distanceIn statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. Both measures are named after A... - Information radius (Jensen–Shannon divergenceJensen–Shannon divergenceIn probability theory and statistics, the Jensen–Shannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius or total divergence to the average. It is based on the Kullback–Leibler divergence, with the notable ...
) - Harmonic meanHarmonic meanIn mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired....
- Skew divergence
- Confusion probability
- Tau metric, an approximation of the Kullback–Leibler divergenceKullback–Leibler divergenceIn probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...
- Fellegi and Sunters metric (SFS)
- TFIDF or TF/IDF
- Maximal matches
- Lee distance
See also
- approximate string matching
- String matching
- SimMetricsSimMetricsSimMetrics is an open source extensible library of algorithms for calculating string metrics - measures of similarity or dissimilarity between two text strings...
- an implementation - Carnegie Mellon University open source library
External links
- http://www.dcs.shef.ac.uk/~sam/stringmetrics.html A fairly complete overview