SimMetrics
Encyclopedia
SimMetrics is an open source
extensible library of algorithms for calculating string metrics - measures of similarity or dissimilarity between two text strings
. SimMetrics was developed and released by Dr Sam Chapman within the University of Sheffield
.
Licensed under the terms of the GNU General Public License
.
The SimMetrics open source
library includes the following metrics
SimMetrics provides a library of floating-point based (0.0-1.0) similarity measures between pairs of string data as well as the unnormalised metric output.
SimMetrics has been reimplemented and expanded by the original authors as the new tool K-Integrate. K-Integrate is a part of a commercial venture in the company Knowledge Now Limited,Knowledge Now Limited this tool unlike SimMetrics is obtainable under a commercial license for commercial usage.
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
extensible library of algorithms for calculating string metrics - measures of similarity or dissimilarity between two text strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
. SimMetrics was developed and released by Dr Sam Chapman within the University of Sheffield
University of Sheffield
The University of Sheffield is a research university based in the city of Sheffield in South Yorkshire, England. It is one of the original 'red brick' universities and is a member of the Russell Group of leading research intensive universities...
.
Licensed under the terms of the GNU General Public License
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....
.
The SimMetrics open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
library includes the following metrics
- Levenshtein distanceLevenshtein distanceIn information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...
, - Block distance or city block distance or L2 distance,
- Cosine similarityCosine similarityCosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same...
, - Jaccard indexJaccard indexThe Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
, - Needleman–Wunsch algorithmNeedleman–Wunsch algorithmThe Needleman–Wunsch algorithm performs a global alignment on two sequences . It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was published in 1970 by Saul B. Needleman and Christian D...
or Sellers algorithm, - Smith–Waterman algorithm,
- Gotoh distance or Smith-Waterman-Gotoh distance,
- Monge Elkan distance,
- Jaro–Winkler distance,
- SoundexSoundexSoundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless...
, - Matching coefficient,
- Dice's coefficient,
- Jaccard similarity or Jaccard coefficient or Tanimoto coefficient,
- Overlap coefficient,
- Euclidean distanceEuclidean distanceIn mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
, - Q-gram distance,
- and more.
SimMetrics provides a library of floating-point based (0.0-1.0) similarity measures between pairs of string data as well as the unnormalised metric output.
SimMetrics has been reimplemented and expanded by the original authors as the new tool K-Integrate. K-Integrate is a part of a commercial venture in the company Knowledge Now Limited,Knowledge Now Limited this tool unlike SimMetrics is obtainable under a commercial license for commercial usage.
External links
- SimMetrics Description of SimMetrics Project
- Second String similar library