Multiple EM for Motif Elicitation
Encyclopedia
Multiple EM for Motif Elicitation or MEME is a tool for discovering motifs in a group of related DNA
or protein
sequences.
A motif
is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.
MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. It uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.
However, one often doesn't know where the starting position is. Several possibilities exist:
Now one counts the number of nucleotides contained in all sequences:
Now one needs to sum up the total: 7+3+12+5 = 27; this gives us a "dividing factor" for each base, or the equivalent probability of each nucleotides.
A: 7/27 = 0.26
C: 3/27 = 0.11
G: 12/27 = 0.44
T: 5/27 = 0.19
Now one can "redo" the weight matrix (WM) by dividing it by the total number of sequences (in our case 3):
A: 0.33 0.66 0.00 0.00 0.00 0.66 0.66 0.00 0.00
C: 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33
G: 0.00 0.33 1.00 1.00 0.00 0.33 0.00 1.00 0.33
T: 0.00 0.00 0.00 0.00 1.00 0.00 0.33 0.00 0.33
Next, one divides the entries of the WM at position xi with the probability of the base x.
A: 1.27 2.30 0.00 0.00 0.00 2.30 2.30 0.00 0.00
C: 6.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
G: 0.00 0.75 2.27 2.27 0.00 0.75 0.00 2.27 0.75
T: 0.00 0.00 0.00 0.00 5.26 0.00 1.74 0.00 1.74
In general one would now multiply the probabilities. In our case one would have zero for every one. Due to this we take the logarithm and define log(0)=(-10):
This is our new weight matrix (WM). One is ready to use an example of a promoter sequence to determine its score. To do this, one has to add the numbers found at the position xi of the logarithmic WM.
For instance, if one takes the AGGCTGATC promoter:
0.10 - 0.1 + 0.36 - 10 + 0.72 - 0.1 + 0.36 - 10 + 0.48 = -18.18
This is then divided by the number of entries (in our case 9) yielding a score of -2.02.
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
or protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
sequences.
A motif
Sequence motif
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...
is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.
MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. It uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.
Definition
What the MEME algorithms actually does can be understood from two different perspectives. From a biological point of view, MEME identifies and characterizes shared motifs in a set of unaligned sequences. From the computer science aspect, MEME finds a set of non-overlapping, approximately matching substrings given a starting set of strings.Use
With MEME one can find similar biological functions and structures in different sequences. One has to take into account that the sequences variation can be significant and that the motifs are sometimes very small. It is also useful to take into account that the binding sites for proteins are very specific. This makes it easier to reduce wet-lab experiments (reduces costs and time). Indeed to better discover the motifs relevant for a biological point of view one has to carefully choose:- The best width of motifs.
- The number of occurrences in each sequence.
- The composition of each motif.
Algorithm Components
The algorithm uses several types of well know functions:- Expectation maximization (EM).
- EM based heuristic for choosing the EM starting point.
- Maximum likelihoodMaximum likelihoodIn statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
ratio based (LRT-based). Heuristic for determining the best number of model-free parameters. - Multi-start for searching over possible motif widths.
- Greedy search for finding multiple motifs.
However, one often doesn't know where the starting position is. Several possibilities exist:
- Exactly one motif per sequence.
- One or zero motif per sequence.
- Any number of motifs per sequence.
Example
In the following example, one has a weight matrix of 3 different sequences, without gaps.1: C G G G T A A G T |
2: A A G G T A T G C |
3: C A G G T G A G G |
Now one counts the number of nucleotides contained in all sequences:
A: 1 2 0 0 0 2 2 0 0 | 7 |
C: 2 0 0 0 0 0 0 0 1 | 3 |
G: 0 1 3 3 0 1 0 3 1 | 12 |
T: 0 0 0 0 3 0 1 0 1 | 5 |
Now one needs to sum up the total: 7+3+12+5 = 27; this gives us a "dividing factor" for each base, or the equivalent probability of each nucleotides.
A: 7/27 = 0.26
C: 3/27 = 0.11
G: 12/27 = 0.44
T: 5/27 = 0.19
Now one can "redo" the weight matrix (WM) by dividing it by the total number of sequences (in our case 3):
A: 0.33 0.66 0.00 0.00 0.00 0.66 0.66 0.00 0.00
C: 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33
G: 0.00 0.33 1.00 1.00 0.00 0.33 0.00 1.00 0.33
T: 0.00 0.00 0.00 0.00 1.00 0.00 0.33 0.00 0.33
Next, one divides the entries of the WM at position xi with the probability of the base x.
A: 1.27 2.30 0.00 0.00 0.00 2.30 2.30 0.00 0.00
C: 6.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
G: 0.00 0.75 2.27 2.27 0.00 0.75 0.00 2.27 0.75
T: 0.00 0.00 0.00 0.00 5.26 0.00 1.74 0.00 1.74
In general one would now multiply the probabilities. In our case one would have zero for every one. Due to this we take the logarithm and define log(0)=(-10):
A: 0.10 0.36 -10 -10 -10 0.36 0.36 -10 -10 |
C: 0.78 -10 -10 -10 -10 -10 -10 -10 0.48 |
G: -10 -0.1 0.36 0.36 -10 -0.1 -10 0.36 |
T: -10 -10 -10 -10 0.72 -10 0.24 -10 0.24 |
This is our new weight matrix (WM). One is ready to use an example of a promoter sequence to determine its score. To do this, one has to add the numbers found at the position xi of the logarithmic WM.
For instance, if one takes the AGGCTGATC promoter:
0.10 - 0.1 + 0.36 - 10 + 0.72 - 0.1 + 0.36 - 10 + 0.48 = -18.18
This is then divided by the number of entries (in our case 9) yielding a score of -2.02.
Shortcomings
The MEME algorithms has several drawbacks including:- Allowance for gaps/substitutions/insertions not included.
- Ability to test significance often not included.
- Erased input data each time a new motif is discovered (the algorithm assumes the new motif is correct).
- Limitation to two component case.
- Time complexity is high.
- Very pessimistic about alignment (which might lead to missed signals).
See also
- Sequence motifSequence motifIn genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...
- Sequence alignmentSequence alignmentIn bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
- GPU Accelerated version of MEME
External links
- The MEME Suite — Motif-based sequence analysis tools