GeneMark
Encyclopedia
GeneMark is a family of gene prediction programs developed at the Georgia Institute of Technology
Georgia Institute of Technology
The Georgia Institute of Technology is a public research university in Atlanta, Georgia, in the United States...

 in Atlanta
Atlanta, Georgia
Atlanta is the capital and most populous city in the U.S. state of Georgia. According to the 2010 census, Atlanta's population is 420,003. Atlanta is the cultural and economic center of the Atlanta metropolitan area, which is home to 5,268,860 people and is the ninth largest metropolitan area in...

. First developed in 1993, GeneMark was the first gene finding
Gene prediction
In computational biology gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions...

 method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae
Haemophilus influenzae
Haemophilus influenzae, formerly called Pfeiffer's bacillus or Bacillus influenzae, Gram-negative, rod-shaped bacterium first described in 1892 by Richard Pfeiffer during an influenza pandemic. A member of the Pasteurellaceae family, it is generally aerobic, but can grow as a facultative anaerobe. H...

, and the first completely sequenced archaea
Archaea
The Archaea are a group of single-celled microorganisms. A single individual or species from this domain is called an archaeon...

, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain
Markov chain
A Markov chain, named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process characterized as memoryless: the next state depends only on the current state and not on the...

 models of protein-coding DNA sequence
DNA sequence
The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...

 as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of a known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code
Genetic code
The genetic code is the set of rules by which information encoded in genetic material is translated into proteins by living cells....

 in one of six possible frames (including three frames in complementary DNA
Complementary DNA
In genetics, complementary DNA is DNA synthesized from a messenger RNA template in a reaction catalyzed by the enzyme reverse transcriptase and the enzyme DNA polymerase. cDNA is often used to clone eukaryotic genes in prokaryotes...

 strand) or to be "non-coding".

Prokaryotic

The GeneMark.hmm algorithm was designed to improve gene prediction quality by finding exact gene starts. The idea was to integrate the GeneMark models into a naturally designed hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

 framework, with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome
Ribosome
A ribosome is a component of cells that assembles the twenty specific amino acid molecules to form the particular protein molecule determined by the nucleotide sequence of an RNA molecule....

 binding site
Binding site
In biochemistry, a binding site is a region on a protein, DNA, or RNA to which specific other molecules and ions—in this context collectively called ligands—form a chemical bond...

 model is used to make the gene-start predictions more accurate. In evaluations by different groups, GeneMark.hmm was shown to be significantly more accurate than GeneMark in exact gene prediction. Since 1998, GeneMark.hmm and its self-training version GeneMarkS have been the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.

Eukaryotic

After developing the prokaryotic version of GeneMark.hmm, the approach was extended to the eukaryotic genomes, where accurate prediction of protein coding exon
Exon
An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA...

 boundaries presented a major challenge. The hidden Markov model architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal, and terminal exons, intron
Intron
An intron is any nucleotide sequence within a gene that is removed by RNA splicing to generate the final mature RNA product of a gene. The term intron refers to both the DNA sequence within a gene, and the corresponding sequence in RNA transcripts. Sequences that are joined together in the final...

s, intergenic region
Intergenic region
An Intergenic region is a stretch of DNA sequences located between clusters of genes that contain few or no genes. Occasionally some intergenic DNA acts to control genes nearby, but most of it has no currently known function...

s and single exon genes located on both DNA strands. It also includes hidden states for the initiation site and termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.

Heuristic Models

To accurately find genes in DNA sequences using computers, models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence are required. A heuristic method for deriving the parameters of inhomogeneous Markov models of protein coding regions was proposed in 1999. This heuristic uses the observation that the parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content. Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

s) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm.

Models built by the heuristic approach can be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmid
Plasmid
In microbiology and genetics, a plasmid is a DNA molecule that is separate from, and can replicate independently of, the chromosomal DNA. They are double-stranded and, in many cases, circular...

s. This method can also be used for highly inhomogeneous genomes, where the Markov models must be adjusted to account for local DNA composition. The heuristic method provides evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK