Protein structure prediction
Encyclopedia
Protein structure prediction is the prediction of the three-dimensional structure of a protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

 from its amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

 sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse problem of protein design
Protein design
Protein design is the design of new protein molecules, either from scratch or by making calculated variations on a known structure. The use of rational design techniques for proteins is a major aspect of protein engineering....

. Protein structure prediction is one of the most important goals pursued by bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 and theoretical chemistry
Theoretical chemistry
Theoretical chemistry seeks to provide theories that explain chemical observations. Often, it uses mathematical and computational methods that, at times, require advanced knowledge. Quantum chemistry, the application of quantum mechanics to the understanding of valency, is a major component of...

; it is highly important in medicine
Medicine
Medicine is the science and art of healing. It encompasses a variety of health care practices evolved to maintain and restore health by the prevention and treatment of illness....

 (for example, in drug design
Drug design
Drug design, also sometimes referred to as rational drug design or structure-based drug design, is the inventive process of finding new medications based on the knowledge of the biological target...

) and biotechnology
Biotechnology
Biotechnology is a field of applied biology that involves the use of living organisms and bioprocesses in engineering, technology, medicine and other fields requiring bioproducts. Biotechnology also utilizes these products for manufacturing purpose...

 (for example, in the design of novel enzymes). Every two years, the performance of current methods is assessed in the CASP
CASP
CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994...

 experiment (Critical Assessment of Techniques for Protein Structure Prediction).

Secondary structure

Secondary structure prediction is a set of techniques in bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 that aim to predict the local secondary structure
Secondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...

s of protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

s and RNA
RNA
Ribonucleic acid , or RNA, is one of the three major macromolecules that are essential for all known forms of life....

 sequences based only on knowledge of their primary structure — amino acid
Amino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...

 or nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

 sequence, respectively. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices
Alpha helix
A common motif in the secondary structure of proteins, the alpha helix is a right-handed coiled or spiral conformation, in which every backbone N-H group donates a hydrogen bond to the backbone C=O group of the amino acid four residues earlier...

, beta strand
Beta sheet
The β sheet is the second form of regular secondary structure in proteins, only somewhat less common than the alpha helix. Beta sheets consist of beta strands connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet...

s (often noted as "extended" conformations), or turns
Turn (biochemistry)
A turn is an element of secondary structure in proteins where the polypeptide chain reverses its overall direction.- Definition :According to the most common definition, a turn is a structural motif where the Cα atoms of two residues separated by few peptide bonds are in close approach A turn is...

. The success of a prediction is determined by comparing it to the results of the DSSP
DSSP (protein)
The DSSP algorithm is the standard method for assigning secondary structure to the amino acids of a protein, given the atomic-resolution coordinates of the protein...

 algorithm applied to the crystal structure
X-ray crystallography
X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and causes the beam of light to spread into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a...

 of the protein; for nucleic acids, it may be determined from the hydrogen bond
Hydrogen bond
A hydrogen bond is the attractive interaction of a hydrogen atom with an electronegative atom, such as nitrogen, oxygen or fluorine, that comes from another molecule or chemical group. The hydrogen must be covalently bonded to another electronegative atom to create the bond...

ing pattern. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices
Transmembrane helix
Transmembrane domain usually denotes a single transmembrane alpha helix of a transmembrane protein. It is called a "domain" because an alpha-helix in a membrane can fold independently from the rest of the protein, similar to domains of water-soluble proteins...

 and coiled coil
Coiled coil
A coiled coil is a structural motif in proteins, in which 2-7 alpha-helices are coiled together like the strands of a rope . Many coiled coil type proteins are involved in important biological functions such as the regulation of gene expression e.g. transcription factors...

s in proteins, or canonical microRNA structures in RNA.

The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions in fold recognition and ab initio
Ab initio
ab initio is a Latin term used in English, meaning from the beginning.ab initio may also refer to:* Ab Initio , a leading ETL Tool Software Company in the field of Data Warehousing.* ab initio quantum chemistry methods...

 protein structure prediction, classification of structural motif
Structural motif
In a chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a supersecondary structure, which appears also in a variety of other molecules...

s, and refinement of sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

s. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks
Benchmark (computing)
In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it...

 such as LiveBench
LiveBench
LiveBench is a continuously running benchmark project for assessing the quality of protein structure prediction and secondary structure prediction methods. LiveBench focuses mainly on homology modeling and protein threading but also includes secondary structure prediction, comparing publicly...

 and EVA
EVA (benchmark)
EVA is a continuously running benchmark project for assessing the quality of protein structure prediction and secondary structure prediction methods...

.

Background

Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, focused on identifying likely alpha helices and were based mainly on helix-coil transition model
Helix-coil transition model
Helix–coil transition models are formalized techniques in statistical mechanics developed to describe conformations of linear polymers in solution. The models are usually but not exclusively applied to polypeptides as a measure of the relative fraction of the molecule in an alpha helix conformation...

s. Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60-65% accurate, and often underpredict beta sheets. The evolution
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...

ary conservation of secondary structures can be exploited by simultaneously assessing many homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

 sequences in a multiple sequence alignment
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

, by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 methods such as neural nets
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

 and support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s, these methods can achieve up 80% overall accuracy in globular protein
Globular protein
Globular proteins, or spheroproteins are one of the two main protein classes, comprising "globe"-like proteins that are more or less soluble in aqueous solutions...

s. The theoretical upper limit of accuracy is around 90%, partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Limitations are also imposed by secondary structure prediction's inability to account for tertiary structure
Tertiary structure
In biochemistry and molecular biology, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates.-Relationship to primary structure:...

; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.

Chou-Fasman method

The Chou-Fasman method
Chou-Fasman method
The Chou–Fasman method are an empirical technique for the prediction of secondary structures in proteins, originally developed in the 1970s. The method is based on analyses of the relative frequencies of each amino acid in alpha helices, beta sheets, and turns based on known protein structures...

 was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50-60% accurate in predicting secondary structures.

GOR method

The GOR method
GOR method
The GOR method is an information theory-based method for the prediction of secondary structures in proteins. It was developed in the late 1970s shortly after the simpler Chou-Fasman method...

, named for the three scientists who developed it — Garnier, Osguthorpe, and Robson — is an information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

-based method developed not long after Chou-Fasman. It uses a more powerful probabilistic techniques of Bayesian inference
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

. The method is a specific optimized application of mathematics and algorithms developed in a series of papers by Robson and colleagues, eg. and ). The GOR method is capable of continued extension by such principles, and has gone through several versions. The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the conditional probability
Conditional probability
In probability theory, the "conditional probability of A given B" is the probability of A if B is known to occur. It is commonly notated P, and sometimes P_B. P can be visualised as the probability of event A when the sample space is restricted to event B...

 of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a small number of amino acids such as proline
Proline
Proline is an α-amino acid, one of the twenty DNA-encoded amino acids. Its codons are CCU, CCC, CCA, and CCG. It is not an essential amino acid, which means that the human body can synthesize it. It is unique among the 20 protein-forming amino acids in that the α-amino group is secondary...

 and glycine
Glycine
Glycine is an organic compound with the formula NH2CH2COOH. Having a hydrogen substituent as its 'side chain', glycine is the smallest of the 20 amino acids commonly found in proteins. Its codons are GGU, GGC, GGA, GGG cf. the genetic code.Glycine is a colourless, sweet-tasting crystalline solid...

. Weak contributions from each of many neighbors can add up to strong effect overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions. Later GOR methods considered also pairs of amino acids, significantly improving performance. The major difference from the following technique is perhaps that the weights in an implied network of contributing terms are assigned a priori, from statistical analysis of proteins of known structure, not by feedback to optimize agreement with a training set of such.

Machine learning

Neural network
Artificial neural network
An artificial neural network , usually called neural network , is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes...

 methods use training sets of solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet.

Support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s have proven particularly useful for predicting the locations of turns
Turn (biochemistry)
A turn is an element of secondary structure in proteins where the polypeptide chain reverses its overall direction.- Definition :According to the most common definition, a turn is a structural motif where the Cα atoms of two residues separated by few peptide bonds are in close approach A turn is...

, which are difficult to identify with statistical methods. The requirement of relatively small training sets has also been cited as an advantage to avoid overfitting to existing structural data.

Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbone dihedral angle
Dihedral angle
In geometry, a dihedral or torsion angle is the angle between two planes.The dihedral angle of two planes can be seen by looking at the planes "edge on", i.e., along their line of intersection...

s in unassigned regions. Both SVMs and neural networks have been applied to this problem.

Other improvements

It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment, solvent accessibility of residues, protein structural class, and even the organism from which the proteins are obtained. Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class, residue accessible surface area and also contact number
Contact number
Contact number is a simple solvent exposure measure that measures residue burial in proteins. The definition of CN varies between authors, but is generally defined as the number of either C\beta or C\alpha atoms with in a sphere around the C\beta or C\alpha atom of the residue. The radius of the...

 information.

Sequence covariation methods rely on the existence of a data set composed of multiple homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

 RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...

; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete
NP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...

.

Tertiary structure

The practical role of protein structure prediction is now more important than ever. Massive amounts of protein sequence data are produced by modern large-scale DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 sequencing efforts such as the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...

. Despite community-wide efforts in structural genomics
Structural genomics
Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches...

, the output of experimentally determined protein structures—typically by time-consuming and relatively expensive X-ray crystallography
X-ray crystallography
X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and causes the beam of light to spread into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a...

 or NMR spectroscopy—is lagging far behind the output of protein sequences.

The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are calculation of protein free energy
Gibbs free energy
In thermodynamics, the Gibbs free energy is a thermodynamic potential that measures the "useful" or process-initiating work obtainable from a thermodynamic system at a constant temperature and pressure...

 and finding the global minimum
Energy minimization
In computational chemistry, energy minimization methods are used to compute the equilibrium configuration of molecules and solids....

 of this energy. A protein structure prediction method must explore the space of possible protein structures which is astronomically large. These problems can be partially bypassed in "comparative" or homology modeling
Homology modeling
Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

 and fold recognition methods, in which the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. On the other hand, the de novo or ab initio protein structure prediction
De novo protein structure prediction
In computational biology, de novo protein structure prediction is the task of estimating a protein's tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused in three areas: alternate lower-resolution...

 methods must explicitly resolve these problems. The progress and challenges in protein structure prediction has been reviewed in Zhang 2008.

Ab initio protein modelling

Ab initio- or de novo- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic protein folding
Protein folding
Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil....

 or apply some stochastic
Stochastic
Stochastic refers to systems whose behaviour is intrinsically non-deterministic. A stochastic process is one whose behavior is non-deterministic, in that a system's subsequent state is determined both by the process's predictable actions and by a random element. However, according to M. Kac and E...

 method to search possible solutions (i.e., global optimization
Global optimization
Global optimization is a branch of applied mathematics and numerical analysis that deals with the optimization of a function or a set of functions to some criteria.- General :The most common form is the minimization of one real-valued function...

 of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure de novo for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as Blue Gene
Blue Gene
Blue Gene is a computer architecture project to produce several supercomputers, designed to reach operating speeds in the PFLOPS range, and currently reaching sustained speeds of nearly 500 TFLOPS . It is a cooperative project among IBM Blue Gene is a computer architecture project to produce...

 or MDGRAPE-3) or distributed computing (such as Folding@home
Folding@home
Folding@home is a distributed computing project designed to use spare processing power on personal computers to perform simulations of disease-relevant protein folding and other molecular dynamics, and to improve on the methods of doing so...

, the Human Proteome Folding Project
Human Proteome Folding Project
The Human Proteome Folding Project is a collaborative effort between New York University , the Institute for Systems Biology and the University of Washington , using the Rosetta software developed by the Rosetta Commons....

 and Rosetta@Home
Rosetta@home
Rosetta@home is a distributed computing project for protein structure prediction on the Berkeley Open Infrastructure for Network Computing platform, run by the Baker laboratory at the University of Washington...

). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field.

As an intermediate step towards predicted protein structures, contact map
Protein contact map
A protein contact map represents the distance between all possible residue pairs of a three-dimensional protein structure using a binary two-dimensional matrix. For two residues i and j, the ij element of the matrix is 1 if the two residues are closer than a predetermined threshold, and 0 otherwise...

 predictions have been proposed.

Comparative protein modelling

Comparative protein modelling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of tertiary
Tertiary structure
In biochemistry and molecular biology, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates.-Relationship to primary structure:...

 structural motif
Structural motif
In a chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a supersecondary structure, which appears also in a variety of other molecules...

s to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins.

These methods may also be split into two groups :
Homology modeling
Homology modeling
Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...

 : is based on the reasonable assumption that two homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment. Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences.

Protein threading : scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as 3D-1D fold recognition due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.

Side-chain geometry prediction

Accurate packing of the amino acid side chain
Side chain
In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called "main chain" or backbone. The placeholder R is often used as a generic placeholder for alkyl group side chains in chemical structure diagrams. To indicate other non-carbon...

s represents a separate problem in protein structure prediction. Methods that specifically address the problem of predicting side-chain geometry include dead-end elimination
Dead-end elimination
The dead-end elimination algorithm ' is a method for minimizing a function over a discrete set of independent variables. The basic idea is to identify "dead ends", i.e., "bad" combinations of variables that cannot possibly yield the global minimum and to refrain from searching such combinations...

 and the self-consistent mean field
Self-consistent mean field (biology)
The self-consistent mean field method is an adaptation of mean field theory used in protein structure prediction to determine the optimal amino acid side chain packing given a fixed protein backbone...

 methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "rotamers." The methods attempt to identify the set of rotamers that minimize the model's overall energy.

These methods use rotamer libraries, which are collections of favorable conformations for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the standard deviations about mean dihedral angles, which can be used in sampling. Rotamer libraries are derived from structural bioinformatics
Structural bioinformatics
Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA...

 or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, -60°) values.

Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and Richards
Frederic M. Richards
Frederic Middlebrook Richards , or commonly referred to as Fred Richards, was Sterling Professor Emeritus of Molecular Biophysics and Biochemistry at Yale University.-Biography:...

 at Yale in 1987). Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for -helix, -sheet, or coil secondary structures. Backbone-dependent rotamer libraries present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles and , regardless of secondary structure.

The modern versions of these libraries as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are based on very carefully curated data and are used primarily for structure validation, while others emphasize relative frequencies in much larger data sets and are the form used primarily for structure prediction, such as the Dunbrack rotamer libraries.

Side-chain packing methods are most useful for analyzing the protein's hydrophobic core, where side chains are more closely packed; they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one.

Prediction of structural classes

Statistical methods have been developed for predicting structural classes of proteins based on their amino acid composition, pseudo amino acid composition
Pseudo amino acid composition
Pseudo amino acid composition, or PseAA composition, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction.- Background :...

 and functional domain composition.

Quaternary structure

In the case of complexes of two or more proteins
Protein complex
A multiprotein complex is a group of two or more associated polypeptide chains. If the different polypeptide chains contain different protein domain, the resulting multiprotein complex can have multiple catalytic functions...

, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking methods can be used to predict the structure of the complex. Information of the effect of mutations at specific sites on the affinity of the complex helps to understand the complex structure and to guide docking methods.

Software

I-TASSER is the best server for protein structure prediction according to the 2006-2010 CASP experiments (CASP7, CASP8 and CASP9).

RaptorX
RaptorX / software for protein modeling and analysis
RaptorX for protein modeling and analysisRaptorX is a software and web server for protein structure and function prediction that is free for non-commercial use. RaptorX is among the most popular methods for protein structure prediction. Other popular methods include HHpredHHpred / HHsearch and...

 excels at aligning hard targets according to the 2010 CASP9 experiments.
RaptorX generates the significantly better alignments for the hardest 50 CASP9 template-based modeling targets than other servers including those using consensus and refinement methods.
The RaptorX server is available at server

MODELLER
MODELLER
MODELLER is a computer program used in producing homology models of protein tertiary structures as well as quaternary structures . It implements a technique inspired by nuclear magnetic resonance known as satisfaction of spatial restraints, by which a set of geometrical criteria are used to create...

 is a popular software tool for producing homology models using methodology derived from NMR spectroscopy data processing. SwissModel provides an automated web server for basic homology modeling.

HHpred
HHpred / HHsearch
HHsearch is a program for protein sequence searching that is free for non-commercial use. HHpred is a free protein function and protein structure prediction server based on the HHsearch method...

, bioinfo.pl and Robetta widely used servers for protein structure prediction. HHsearch
HHpred / HHsearch
HHsearch is a program for protein sequence searching that is free for non-commercial use. HHpred is a free protein function and protein structure prediction server based on the HHsearch method...

 is a free software package for protein threading and remote homology detection.

SPARKSx is one of the top performing servers in the CASP focused on the remote fold recognition.

PEP-FOLD is a de novo approach aimed at predicting peptide structures from amino acid sequences, based on a HMM structural alphabet.

Phyre and Phyre2
Phyre / Phyre2
Phyre and Phyre2 are web-based services for protein structure prediction that are free for non-commercial use. Phyre is among the most popular methods for protein structure prediction having been cited over 1000 times...

 are amongst the top performing servers in the CASP international blind trials of structure prediction in homology modelling and remote fold recognition, and are designed with an emphasis on ease of use for non-experts.

RAPTOR (software)
RAPTOR (software)
RAPTOR is protein threading software used for protein structure prediction, given a primary sequence.-Protein threading vs. homology modeling:Researchers attempting to solve a protein's structure start their a study with little more than a protein sequence...

 is a protein threading software that is based on integer programming. The basic algorithm for threading is described in and is fairly straightforward to implement.

QUARK is an on-line server suitable for ab initio protein structure modeling.

Abalone is a Molecular Dynamics
Molecular dynamics
Molecular dynamics is a computer simulation of physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a period of time, giving a view of the motion of the atoms...

 program for folding simulations with explicit or implicit water model
Water model
In computational chemistry, classical water models are used for the simulation of water clusters, liquid water, and aqueous solutions with explicit solvent. These models use the approximations of molecular mechanics...

s.

TIP is a knowledgebase of STRUCTFAST models and precomputed similarity relationships between sequences, structures, and binding sites. Several distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

 projects concerning protein structure prediction have also been implemented, such as the Folding@home
Folding@home
Folding@home is a distributed computing project designed to use spare processing power on personal computers to perform simulations of disease-relevant protein folding and other molecular dynamics, and to improve on the methods of doing so...

, Rosetta@home
Rosetta@home
Rosetta@home is a distributed computing project for protein structure prediction on the Berkeley Open Infrastructure for Network Computing platform, run by the Baker laboratory at the University of Washington...

, Human Proteome Folding Project
Human Proteome Folding Project
The Human Proteome Folding Project is a collaborative effort between New York University , the Institute for Systems Biology and the University of Washington , using the Rosetta software developed by the Rosetta Commons....

, Predictor@home
Predictor@home
Predictor@home was a distributed computing project that used BOINC.It was established by The Scripps Research Institute to predict protein structure from protein sequence in the context of the 6th biannual CASP, or Critical Assessment of Techniques for Protein Structure Prediction...

, and TANPAKU
TANPAKU
TANPAKU was a distributed computing project aimed at attacking the protein structure prediction problem. The project used the Berkeley Open Infrastructure for Network Computing platform, and was developed in collaboration with Yamato lab and Takeda lab groups at the Tokyo University of Science.The...

.

The Foldit
Foldit
Foldit is an online puzzle video game about protein folding. The game is part of an experimental research project, and is developed by the University of Washington's Center for Game Science in collaboration with the UW Department of Biochemistry...

 program seeks to investigate the pattern-recognition and puzzle-solving abilities inherent to the human mind in order to create more successful computer protein structure prediction software.

Computational approaches provide a fast alternative route to antibody structure prediction. Recently developed antibody FV region high resolution structure prediction algorithms, like RosettaAntibody, have been shown to generate high resolution homology models which have been used for successful docking.

Reviews of software for structure prediction can be found at.

Automatic structure prediction servers

CASP
CASP
CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994...

, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment for protein structure prediction taking place every two years since 1994. CASP provides users and research groups with an opportunity to assess the quality of available methods and automatic servers for protein structure prediction. Official results for automatic structure prediction servers in the CASP7 benchmark (2006) are discussed by Battey et al.. Official CASP8 results are available for automatic servers and for human and server predictors. Unofficial results for automatic servers of the 2008 CASP8 benchmark are summarized on several lab websites and ranked according to slightly varying criteria: Zhang lab, Grishin lab, McGuffin lab,
Baker lab, and Cheng lab.

See also

  • Protein design
    Protein design
    Protein design is the design of new protein molecules, either from scratch or by making calculated variations on a known structure. The use of rational design techniques for proteins is a major aspect of protein engineering....

  • Protein structure prediction software
  • De novo protein structure prediction
    De novo protein structure prediction
    In computational biology, de novo protein structure prediction is the task of estimating a protein's tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused in three areas: alternate lower-resolution...

  • Molecular design software
    Molecular Design software
    Molecular design software is software for molecular modeling, that provides special support for developing molecular models de novo.In contrast to the usual molecular modeling programs such as the molecular dynamics and quantum chemistry programs, such software directly supports the aspects related...

  • Molecular modeling software
  • Modelling biological systems
  • Fragment libraries
    Protein fragment library
    Protein backbone fragment libraries have been used successfully in a variety of structural biology applications, including homology modeling, de novo structure prediction, and structure determination...

  • Lattice protein
    Lattice protein
    Lattice proteins are highly simplified computer models of proteins which are used to investigate protein folding.Because proteins are such large molecules, there are severe computational limits on the simulated timescales of their behaviour when modeled in all-atom detail...

    s
  • Statistical potential
    Statistical potential
    In protein structure prediction, a statistical potential or knowledge-based potential is an energy function derived from an analysis of known protein structures in the Protein Data Bank....


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK