Similarity matrix
Encyclopedia
A similarity matrix is a matrix
of scores which express the similarity between two data points. Similarity matrices are strongly related to their counterparts, distance matrices
and substitution matrices
.
. Higher scores are given to more-similar characters, and lower or negative scores for dissimilar characters.
Nucleotide
similarity matrices are used to align nucleic acid
sequences. Because there are only four nucleotides commonly found in DNA
(Adenine
(A), Cytosine
(C), Guanine
(G) and Thymine
(T)), nucleotide similarity matrices are much simpler than protein
similarity matrices. For example, a simple matrix will assign identical bases a score of +1 and non-identical bases a score of −1. A more complicated matrix would give a higher score to transitions (changes from a pyrimidine
such as C or T to another pyrimidine, or from a purine
such as A or G to another purine) than to transversions (from a pyrimidine to a purine or vice versa).
The match/mismatch ratio of the matrix sets the target evolutionary distance. The +1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences that are 99% identical; a +1/−1 (or +4/−4) matrix is much more suited to sequences with about 70% similarity. Matrices for lower similarity sequences require longer sequence alignments.
Amino acid
similarity matrices are more complicated, because there are 20 amino acids coded for by the genetic code
. Therefore, the similarity matrix for amino acids contains 400 entries (although it is usually symmetric). The first approach scored all amino acid changes equally. A later refinement was to determine amino acid similarities based on how many base changes were required to change a codon to code for that amino acid. This model is better, but it doesn't take into account the selective pressure of amino acid changes. Better models took into account the chemical properties of amino acids.
One approach has been to empirically generate the similarity matrices. The Dayhoff
method used phylogenetic trees and sequences taken from species on the tree. This approach has given rise to the PAM
series of matrices. PAM matrices are labelled based on how many nucleotide changes have occurred, per 100 amino acids.
While the PAM matrices benefit from having a well understood evolutionary model, they are most useful at short evolutionary distances (PAM10 - PAM120). At long evolutionary distances, for example PAM250 or 20% identity, it has been shown that the BLOSUM
matrices are much more effective.
The BLOSUM series were generated by comparing a number of divergent sequences. The BLOSUM series are labeled based on how much entropy remains unmutated between all sequences, so a lower BLOSUM number corresponds to a higher PAM number.
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...
of scores which express the similarity between two data points. Similarity matrices are strongly related to their counterparts, distance matrices
Distance matrix
In mathematics, computer science and graph theory, a distance matrix is a matrix containing the distances, taken pairwise, of a set of points...
and substitution matrices
Substitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one character in a sequence changes to other character states over time...
.
Use in sequence alignment
Similarity matrices are used in sequence alignmentSequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
. Higher scores are given to more-similar characters, and lower or negative scores for dissimilar characters.
Nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...
similarity matrices are used to align nucleic acid
Nucleic acid
Nucleic acids are biological molecules essential for life, and include DNA and RNA . Together with proteins, nucleic acids make up the most important macromolecules; each is found in abundance in all living things, where they function in encoding, transmitting and expressing genetic information...
sequences. Because there are only four nucleotides commonly found in DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
(Adenine
Adenine
Adenine is a nucleobase with a variety of roles in biochemistry including cellular respiration, in the form of both the energy-rich adenosine triphosphate and the cofactors nicotinamide adenine dinucleotide and flavin adenine dinucleotide , and protein synthesis, as a chemical component of DNA...
(A), Cytosine
Cytosine
Cytosine is one of the four main bases found in DNA and RNA, along with adenine, guanine, and thymine . It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached . The nucleoside of cytosine is cytidine...
(C), Guanine
Guanine
Guanine is one of the four main nucleobases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine . In DNA, guanine is paired with cytosine. With the formula C5H5N5O, guanine is a derivative of purine, consisting of a fused pyrimidine-imidazole ring system with...
(G) and Thymine
Thymine
Thymine is one of the four nucleobases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine nucleobase. As the name suggests, thymine may be derived by methylation of uracil at...
(T)), nucleotide similarity matrices are much simpler than protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
similarity matrices. For example, a simple matrix will assign identical bases a score of +1 and non-identical bases a score of −1. A more complicated matrix would give a higher score to transitions (changes from a pyrimidine
Pyrimidine
Pyrimidine is a heterocyclic aromatic organic compound similar to benzene and pyridine, containing two nitrogen atoms at positions 1 and 3 of the six-member ring...
such as C or T to another pyrimidine, or from a purine
Purine
A purine is a heterocyclic aromatic organic compound, consisting of a pyrimidine ring fused to an imidazole ring. Purines, including substituted purines and their tautomers, are the most widely distributed kind of nitrogen-containing heterocycle in nature....
such as A or G to another purine) than to transversions (from a pyrimidine to a purine or vice versa).
The match/mismatch ratio of the matrix sets the target evolutionary distance. The +1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences that are 99% identical; a +1/−1 (or +4/−4) matrix is much more suited to sequences with about 70% similarity. Matrices for lower similarity sequences require longer sequence alignments.
Amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...
similarity matrices are more complicated, because there are 20 amino acids coded for by the genetic code
Genetic code
The genetic code is the set of rules by which information encoded in genetic material is translated into proteins by living cells....
. Therefore, the similarity matrix for amino acids contains 400 entries (although it is usually symmetric). The first approach scored all amino acid changes equally. A later refinement was to determine amino acid similarities based on how many base changes were required to change a codon to code for that amino acid. This model is better, but it doesn't take into account the selective pressure of amino acid changes. Better models took into account the chemical properties of amino acids.
One approach has been to empirically generate the similarity matrices. The Dayhoff
Margaret Oakley Dayhoff
Dr. Margaret Belle Dayhoff was an American physical chemist and a pioneer in the field of bioinformatics...
method used phylogenetic trees and sequences taken from species on the tree. This approach has given rise to the PAM
Point accepted mutation
Point accepted mutation , is a set of matrices used to score sequence alignments. The PAM matrices were introduced by Margaret Dayhoff in 1978 based on 1572 observed mutations in 71 families of closely related proteins...
series of matrices. PAM matrices are labelled based on how many nucleotide changes have occurred, per 100 amino acids.
While the PAM matrices benefit from having a well understood evolutionary model, they are most useful at short evolutionary distances (PAM10 - PAM120). At long evolutionary distances, for example PAM250 or 20% identity, it has been shown that the BLOSUM
BLOSUM
The BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Henikoff and Henikoff...
matrices are much more effective.
The BLOSUM series were generated by comparing a number of divergent sequences. The BLOSUM series are labeled based on how much entropy remains unmutated between all sequences, so a lower BLOSUM number corresponds to a higher PAM number.
See also
- Recurrence plotRecurrence plotIn descriptive statistics and chaos theory, a recurrence plot is a plot showing, for a given moment in time, the times at which a phase space trajectory visits roughly the same area in the phase space...
, a powerful visualisation tool of recurrences in dynamical (and other) systems. - Similarity testing