Nucleic acid structure prediction
Encyclopedia
Nucleic acid structure prediction is a computational method to determine nucleic acid secondary and tertiary structure
Nucleic acid structure
Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA It is often divided into four different levels:* Primary structure—the raw sequence of nucleobases of each of the component DNA strands;...

 from its sequence. Secondary structure can be predicted from a single or from several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling (when the structure of a homologous sequence is known).

The problem of predicting nucleic acid secondary structure is dependent mainly on base pair
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...

ing and base stacking interactions; many molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of nucleic acid molecules, such as transfer RNA
Transfer RNA
Transfer RNA is an adaptor molecule composed of RNA, typically 73 to 93 nucleotides in length, that is used in biology to bridge the three-letter genetic code in messenger RNA with the twenty-letter code of amino acids in proteins. The role of tRNA as an adaptor is best understood by...

 or microRNA, is observed. Many secondary structure prediction methods rely on variations of dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

 and therefore are unable to efficiently identify pseudoknots.

While the methods are similar, there are slight differences in the approaches to RNA and DNA structure prediction. In vivo, DNA structures are more likely to be duplexes with full complementarity
Complementarity (molecular biology)
In molecular biology, complementarity is a property of double-stranded nucleic acids such as DNA, as well as DNA:RNA duplexes. Each strand is complementary to the other in that the base pairs between them are non-covalently connected via two or three hydrogen bonds...

 between two strands, while RNA structures are more likely to fold into complex secondary and tertiary structures such as in the ribosome
Ribosome
A ribosome is a component of cells that assembles the twenty specific amino acid molecules to form the particular protein molecule determined by the nucleotide sequence of an RNA molecule....

, spliceosome
Spliceosome
A spliceosome is a complex of snRNA and protein subunits that removes introns from a transcribed pre-mRNA segment. This process is generally referred to as splicing.-Composition:...

, or tRNA. This is partially because the extra oxygen in RNA increases the propensity for hydrogen bond
Hydrogen bond
A hydrogen bond is the attractive interaction of a hydrogen atom with an electronegative atom, such as nitrogen, oxygen or fluorine, that comes from another molecule or chemical group. The hydrogen must be covalently bonded to another electronegative atom to create the bond...

ing in the nucleic acid backbone. The energy parameters
Nucleic acid thermodynamics
Nucleic acid thermodynamics is the study of the thermodynamics of nucleic acid molecules, or how temperature affects nucleic acid structure. For multiple copies of DNA molecules, the melting temperature is defined as the temperature at which half of the DNA strands are in the double-helical state...

 are also different for the two nucleic acids.

Single sequence structure prediction

A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given just the nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the secondary structure
Secondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...

 or intra-molecular base-pairing interactions of the molecule. This is shown by the high conservation of base-pairings
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...

 across diverse species.

The most stable structure

Secondary structure of small RNA molecules is largely determined by strong, local interactions such as hydrogen bond
Hydrogen bond
A hydrogen bond is the attractive interaction of a hydrogen atom with an electronegative atom, such as nitrogen, oxygen or fluorine, that comes from another molecule or chemical group. The hydrogen must be covalently bonded to another electronegative atom to create the bond...

s and base stacking. Summing the free energy for such interactions should provide an approximation for the stability of a given structure. To predict the folding free energy of a given secondary structure, an empirical nearest-neighbor model is used. In the nearest neighbor model the free energy change for each motif depends on the sequence of the motif and of its closest base-pairs. The model and parameters of minimal energy for Watson–Crick pairs, GU pairs and loop regions were derived from empirical calorimetric experiments, the most up-to-date parameters were published in 2004, although most software packages use the previous set assembled in 1999.

The simplest way to find the lowest free energy structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of RNA (Number of secondary structures = (1,8)N , N- number of nucleotides). For longer RNA molecules, the number of possible secondary structures is huge: a sequence of 100 nucleotides has more than 1025 possible secondary structures.

Dynamic programming algorithms

The first and the most popular method for finding the most stable structure is a dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

 algorithm. One of the first attempts to predict RNA secondary structure was made by Ruth Nussinov and co-workers who used the dynamic programming method for maximizing the number of base-pairs. However, there are several issues with this approach: most importantly, the solution is not unique. Nussinov et al. published an adaptation of their approach using a simple nearest-neighbour energy model in 1980. In 1981, Michael Zuker and Patrick Stiegler proposed using a slightly refined dynamic programming approach to modelling nearest neighbor energy interactions that directly incorporates stacking into the prediction.

Dynamic programming algorithms provide a means to implicitly check all variants of possible RNA secondary structures without explicitly generating the structures. First, the lowest conformational free energy is determined for each possible sequence fragment starting with the shortest fragments and then for longer fragments. For longer fragments, recursion on the optimal free energy changes determined for shorter sequences speeds the determination of the lowest folding free energy. Once the lowest free energy of the complete sequence is calculated, the exact structure of RNA molecule is determined.

Dynamic programming algorithms are commonly used to detect base pair
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...

ing patterns that are "well-nested", that is, form hydrogen bond
Hydrogen bond
A hydrogen bond is the attractive interaction of a hydrogen atom with an electronegative atom, such as nitrogen, oxygen or fluorine, that comes from another molecule or chemical group. The hydrogen must be covalently bonded to another electronegative atom to create the bond...

s only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include double helices, stem-loop
Stem-loop
Stem-loop intramolecular base pairing is a pattern that can occur in single-stranded DNA or, more commonly, in RNA. The structure is also known as a hairpin or hairpin loop. It occurs when two regions of the same strand, usually complementary in nucleotide sequence when read in opposite directions,...

s, and variants of the "cloverleaf" pattern found in transfer RNA
Transfer RNA
Transfer RNA is an adaptor molecule composed of RNA, typically 73 to 93 nucleotides in length, that is used in biology to bridge the three-letter genetic code in messenger RNA with the twenty-letter code of amino acids in proteins. The role of tRNA as an adaptor is best understood by...

 molecules. These methods rely on pre-calculated parameters which estimate the free energy
Thermodynamic free energy
The thermodynamic free energy is the amount of work that a thermodynamic system can perform. The concept is useful in the thermodynamics of chemical or thermal processes in engineering and science. The free energy is the internal energy of a system less the amount of energy that cannot be used to...

 associated with particular types of base-pairing interactions, including Watson-Crick and Hoogsteen base pair
Hoogsteen base pair
A Hoogsteen base pair is a variation of base-pairing in nucleic acids such as the A•T pair. In this manner, two nucleobases on each strand can be held together by hydrogen bonds in the major groove...

s. Depending on the complexity of the method, single base pairs may be considered as well as short two- or three-base segments to incorporate the effects of base stacking. This method cannot identify pseudoknot
Pseudoknot
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow mosaic virus in 1982...

s, which are not well nested, without substantial algorithmic modifications that are extremely computationally expensive.

Suboptimal structures

The accuracy of RNA secondary structure prediction from a single sequence by free energy minimization is limited by several factors:
  1. The free energy value's list in nearest neighbor model is incomplete
  2. Not all known RNA folds in such a way as to conform with the thermodynamic minimum.
  3. Some RNA sequences have more than one biologically active conformation (e. i. Riboswitch
    Riboswitch
    In molecular biology, a riboswitch is a part of an mRNA molecule that can directly bind a small target molecule, and whose binding of the target affects the gene's activity. Thus, an mRNA that contains a riboswitch is directly involved in regulating its own activity, in response to the...

    es)

For this reason, the ability to predict structures which have similar low free energy would provide significant information. Such structures are termed suboptimal structures. MFOLD is one program that generates suboptimal structures.

Predicting pseudoknots

One of the issues when predicting RNA secondary structure is that the standard free energy minimization and statistical sampling methods can not find pseudoknot
Pseudoknot
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow mosaic virus in 1982...

s. The major problem is that the usual dynamic programing algorithms, when predicting secondary structure, consider only the interactions between the closest nucleotides, while pseudoknotted structures are formed due to interactions between distant nucleotides. Rivas and Eddy published a dynamic programming algorithm for predicting pseudoknots. However, this dynamic programming algorithm is very slow. The standard dynamic programming algorithm for free energy minimization scales O(N3) in time (N is the number of nucleotides in the sequence), while the Rivas and Eddy algorithm scales O(N6) in time. This has prompted several researchers to implement versions of the algorithm that restrict classes of pseudoknots, resulting in performance gains. For example, pknotsRG tool includes only the class of simple recursive pseudoknots and scales O(N4) in time.

Other approaches for RNA secondary structure prediction

Another approach for RNA secondary structure determination is to sample structures from the Boltzmann ensemble, as exemplified by the program SFOLD. The program generates a statistical sample of all possible RNA secondary structures. The algorithm samples secondary structures according to the Boltzmann distribution. The sampling method offers an appealing solution to the problem of uncertainties in folding.

Comparative secondary structure prediction

Sequence covariation methods rely on the existence of a data set composed of multiple homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

 RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...

; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete
NP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...

.

In general, the problem of alignment and consensus structure prediction are closely related. Three different approaches to the prediction of consensus structures can be distinguished:
  1. Folding of alignment
  2. Simultaneous sequence alignment and folding
  3. Alignment of predicted structures

Align then fold

A practical heuristic approach is to use multiple sequence alignment
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

 tools to produce an alignment of several RNA sequences, to find consensus sequence and then fold it. The quality of the alignment determines the accuracy of the consensus structure model. Consensus sequences are folded using various approaches similarly as in individual structure prediction problem. The thermodynamic folding approach is exemplified by RNAalifold program. The different approaches are exemplified by Pfold and ILM programs. Pfold program implements a SCFGs
Stochastic context-free grammar
A stochastic context-free grammar is a context-free grammar in which each production is augmented with a probability...

. ILM (iterated loop matching) unlike the other algorithms for folding of alignments, can return pseudocnoted structures. It uses combination of thermodynamics and mutual information content scores.

Align and fold

Evolution
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...

 frequently preserves functional RNA structure better than RNA sequence. Hence, a common biological problem is to infer a common structure for two or more highly diverged but homologous RNA sequences. In practice, sequence alignments become unsuitable and do not help to improve the accuracy of structure prediction, when sequence similarity of two sequences is less than 50%.

Structure-based alignment programs improves the performance of these alignments and most of them are variants of the Sankoff algorithm. Basically, Sankoff algorithm is a merger of sequence alignment and Nussinov (maximal-pairing) folding dynamic programming method. Sankoff algorithm itself is a theoretical exercise because it requires extreme computational resources ( (O(n3m) in time, and O(n2m) in space, where n is the sequence length and m is the number of sequences). Some notable attempts at implementing restricted versions of Sankoff's algorithm are Foldalign, Dynalign, PMmulti/PMcomp, Stemloc
Stemloc
Stemloc is a program for pairwise RNA structural alignment based on probabilistic models of RNA structure known as Pair stochastic context-free grammars. Stemloc implements constrained versions of the Sankoff algorithms for simultaneous structure prediction and sequence alignment of multiple...

, and Murlet. In these implementations the maximal length of alignment or variants of possible consensus structures are restricted. For example, Foldalign focuses on local alignments and restricts the possible length of the sequences alignment.

Fold then align

A less widely used approach is to fold the sequences using single sequence structure prediction methods and align the resulting structures using tree-based metrics. The fundamental weakness with this approach is that single sequence predictions are often inaccurate, thus all further analyses are affected.

Tertiary structure prediction

Once secondary structure of RNA is known, the next challenge is to predict tertiary structure. The biggest problem is to determine the structure of regions between double stranded helical regions. Also RNA molecules often contain posttranscriptionally modified nucleosides, which because of new possible non-canonical interactions, cause a lot of troubles for tertiary structure prediction.

See also

  • RNA
    RNA
    Ribonucleic acid , or RNA, is one of the three major macromolecules that are essential for all known forms of life....

  • RNA structure
    RNA structure
    Biomolecular structure is the structure of biomolecules, mainly proteins and the nucleic acids DNA and RNA. The structure of these molecules is frequently decomposed into primary structure, secondary structure, tertiary structure, and quaternary structure. The scaffold for this structure is...

  • Non-coding RNA
    Non-coding RNA
    A non-coding RNA is a functional RNA molecule that is not translated into a protein. Less-frequently used synonyms are non-protein-coding RNA , non-messenger RNA and functional RNA . The term small RNA is often used for short bacterial ncRNAs...

  • List of RNA structure prediction software
  • List of nucleic acid simulation software

Further reading

  • Baker D and Sali A. Protein structure prediction and structural genomics. Science 2001; 294: 93-6.
  • Chiu, D.K. and Kolodziejczak, T. (1991) Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci, . 7, 347–352
  • Do CB, Woods DA, Batzoglou S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 22(14):e90-8.
  • Gutell, R.R., et al. (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Res, . 20, 5785–5795
  • Leontis NB, Lescoute A, and Westhof E. The building blocks and motifs of RNA architecture. Curr Opin Struct Biol 2006; 16: 279-87.
  • Macke T, Case D: Modeling unusual nucleic acid structures. In Molecular Modeling of Nucleic Acids. Edited by Leontes N, SantaLucia JJ. Washington, DC: American Chemical Society; 1998:379-393.
  • Major F: Building three-dimensional ribonucleic acid structures. Comput Sci Eng 2003, 5:44-53.
  • Massire C, Westhof E: MANIP: an interactive tool for modelling RNA. J Mol Graph Model 1998, 16:197-205, 255–257.
  • Tuzet, H. & Perriquet, O., 2004. CARNAC: folding families of related RNAs. Nucleic Acids Research, 32(Web Server issue), W142-145.
  • Touzet, H., 2007. Comparative analysis of RNA genes: the caRNAc software. Methods in Molecular Biology (Clifton, N.J.), 395, 465-474.
  • ModeRNA: A program for comparative RNA modeling
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK