Protein-protein interaction prediction
Encyclopedia
Protein–protein interaction prediction is a field combining bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

 and structural biology
Structural biology
Structural biology is a branch of molecular biology, biochemistry, and biophysics concerned with the molecular structure of biological macromolecules, especially proteins and nucleic acids, how they acquire the structures they have, and how alterations in their structures affect their function...

 in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes. Experimentally, physical interactions between pairs of proteins can be inferred from a variety of experimental techniques, including yeast two-hybrid
Two-hybrid screening
Two-hybrid screening is a molecular biology technique used to discover protein–protein interactions and protein–DNA interactions by testing for physical interactions between two proteins or a single protein and a DNA molecule, respectively.The premise behind the test is the activation of...

 systems, protein-fragment complementation assays (PCA), affinity purification/mass spectrometry
Mass spectrometry
Mass spectrometry is an analytical technique that measures the mass-to-charge ratio of charged particles.It is used for determining masses of particles, for determining the elemental composition of a sample or molecule, and for elucidating the chemical structures of molecules, such as peptides and...

, protein microarray
Protein microarray
A protein microarray, sometimes referred to as a protein binding microarray,provides a multiplex approach to identify protein–protein interactions, to identify the substrates of protein kinases, to identify transcription factor protein-activation, or to identify the targets of biologically active...

s, fluorescence resonance energy transfer (FRET), and Microscale Thermophoresis
Microscale Thermophoresis
Microscale Thermophoresis is a technology for the analysis of biomolecules. Microscale Thermophoresis is the directed movement of particles in a microscopic temperature gradient...

 (MST). Efforts to experimentally determine the interactome
Interactome
Interactome is defined as the whole set of molecular interactions in cells. It is usually displayed as a directed graph. Molecular interactions can occur between molecules belonging to different biochemical families and also within a given family...

 of numerous species are ongoing, and a number of computational methods for interaction prediction have been developed in recent years.

Methods

Proteins that interact are more likely to co-evolve, therefore, it is possible to make inferences about interactions between pairs of proteins based on their phylogenetic distances. It has also been observed in some cases that pairs of interacting proteins have fused orthologues in other organisms. In addition, a number of bound protein complexes have been structurally solved and can be used to identify the residues that mediate the interaction so that similar motifs can be located in other organisms.

Phylogenetic profiling

Phylogenetic profiling
Phylogenetic profiling
Phylogenetic profiling is an important and elegant bioinformatics technique in which the joint presence or joint absence of two traits across a similar distribution of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the same biological...

finds pairs of protein families with similar patterns of presence or absence across large numbers of species. This method identifies pairs likely to act in the same biological process, but does not necessarily imply physical interaction.

Prediction of co-evolved protein pairs based on similar phylogenetic trees

This method involves using a sequence search tool such as BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

 for finding homologues of a pair of proteins, then building multiple sequence alignment
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

s with alignment tools such as Clustal
Clustal
Clustal is a widely used multiple sequence alignment computer program. The latest version is 2.1. There are two main variations:*ClustalW: command line interface*ClustalX: This version has a graphical user interface...

. From these multiple sequence alignments, phylogenetic distance matrices are calculated for each protein in the hypothesized interacting pair. If the matrices are sufficiently similar (as measured by their Pearson correlation coefficient) they are deemed likely to interact.

Identification of homologous interacting pairs

This method consists of searching whether the two sequences have homologues which form a complex in a database of known structures of complexes. The identification of the domains is done by sequence searches against domain databases such as Pfam
Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models.- Features :For each family in Pfam one can:* Look at multiple alignments* View protein domain architectures...

 using BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

. If more than one complex of Pfam domains is identified, then the query sequences are aligned using a hidden Markov tool called HMMER
HMMER
HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences. It does this by comparing a profile-HMM to either a single sequence or a database of sequences...

 to the closest identified homologues, whose structures are known. Then the alignments are analysed to check whether the contact residues of the known complex are conserved in the alignment.

Identification of structural patterns

This method builds a library of known protein–protein interfaces from the PDB
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

, where the interfaces are defined as pairs of polypeptide fragments that are below a threshold slightly larger than the Van der Waals radius
Van der Waals radius
The van der Waals radius, r, of an atom is the radius of an imaginary hard sphere which can be used to model the atom for many purposes. It is named after Johannes Diderik van der Waals, winner of the 1910 Nobel Prize in Physics, as he was the first to recognise that atoms had a finite size and to...

 of the atoms involved. The sequences in the library are then clustered based on structural alignment and redundant sequences are eliminated. The residues that have a high (generally >50%) level of frequency for a given position are considered hotspots. This library is then used to identify potential interactions between pairs of targets, providing that they have a known structure (i.e. present in the PDB
Protein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....

).

Bayesian network modelling

Bayesian methods integrate data from a wide variety of sources, including both experimental results and prior computational predictions, and use these features to assess the likelihood that a particular potential protein interaction is a true positive result. These methods are useful because experimental procedures, particularly the yeast two-hybrid experiments, are extremely noisy and produce many false positives, while the previously mentioned computational methods can only provide circumstantial evidence that a particular pair of proteins might interact.

3D template-based protein complex modelling

This method makes use of known protein complex structures to predict as well as structurally model interactions between query protein sequences. The prediction process generally starts by employing a sequence based method (e.g. Interolog
Interolog
An interolog is a conserved interaction between a pair of proteins which have interacting homologs in another organism. The term was introduced in a 2000 paper by Walhout et al.- Example :...

) to search for protein complex structures that are homologous to the query sequences. These known complex structures are then used as templates to structurally model the interaction between query sequences. This method has the advantage of not only inferring protein interactions but also suggests models of how proteins interact structurally, which can provide some insights into the atomic level mechanism of that interaction. On the other hand, the ability for this method to makes a prediction is limited to a relatively small number of known protein complex structures.

Supervised learning problem

The problem of PPI prediction can be framed as a supervised learning problem. In this paradigm the known protein interactions supervise the estimation of a function that can predict whether an interaction exists or not between two proteins given data about the proteins (e.g., expression levels of each gene in different experimental conditions, location information, phylogenetic profile, etc.).

Relationship to docking methods

The field of protein–protein interaction prediction is closely related to the field of protein–protein docking
Protein-protein docking
Macromolecular docking is the computational modelling of the quaternary structure of complexes formed by two or more interacting biological macromolecules...

, which attempts to use geometric and steric considerations to fit two proteins of known structure into a bound complex. This is a useful mode of inquiry in cases where both proteins in the pair have known structures and are known (or at least strongly suspected) to interact, but since so many proteins do not have experimentally determined structures, sequence-based interaction prediction methods are especially useful in conjunction with experimental studies of an organism's interactome
Interactome
Interactome is defined as the whole set of molecular interactions in cells. It is usually displayed as a directed graph. Molecular interactions can occur between molecules belonging to different biochemical families and also within a given family...

.

See also

  • Interactome
    Interactome
    Interactome is defined as the whole set of molecular interactions in cells. It is usually displayed as a directed graph. Molecular interactions can occur between molecules belonging to different biochemical families and also within a given family...

  • Protein–protein interaction
  • Macromolecular docking
  • Protein–DNA interaction site predictor
  • Two-hybrid screening
    Two-hybrid screening
    Two-hybrid screening is a molecular biology technique used to discover protein–protein interactions and protein–DNA interactions by testing for physical interactions between two proteins or a single protein and a DNA molecule, respectively.The premise behind the test is the activation of...

  • Protein structure prediction software
  • Overview of protein interaction databases

Servers


Dynamics Method

Simple brute force approach:

The Dynamics Method performs PPIP using the same rules as the real system by simulating the dynamics of every force on every atom in two proteins of interest in order to predict first folding, and then interaction. It then does the same for every potential protein pair combination in the genome.
Advantages and disadvantages
  • hypothetically accurate
  • impossible due to massive computational requirements

Folding and Docking

The unworkable Dynamics Method can be broken up into two smaller sub-problems to avoid or minimise computation of dynamics; Folding and Docking:

The most effective Folding
Protein structure prediction
Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...

Prediction Method predicts protein folding structures using a reasonable amount of computational time by using statistical substitution, followed by tweaking. Statistical substitution involves folding a small number of amino acids or residues by using the previously observed statistically dominant folding configuration. Tweaking is similar to heating the structure in that it introduces small random changes and selects those that have the lowest energy states.
Advantages and disadvantages
  • Reasonable results for individual predictions.
  • Accuracy improves as more folding conformations are verified.
  • Too slow to run on a genome wide
  • Not accurate with atypical structures.

Once protein folding has been successfully modeled, Protein Docking
Protein-protein docking
Macromolecular docking is the computational modelling of the quaternary structure of complexes formed by two or more interacting biological macromolecules...

is the next logical step. To simplify the dynamics of docking, Binary docking methods find potentially active site
Active site
In biology the active site is part of an enzyme where substrates bind and undergo a chemical reaction. The majority of enzymes are proteins but RNA enzymes called ribozymes also exist. The active site of an enzyme is usually found in a cleft or pocket that is lined by amino acid residues that...

s on a single folded protein structure and match them to active sites on a second protein using pattern recognition software or geometric hashing algorithms. Conserved domains are observed [52] and used to imply potential binding partners because surface complementarity between interacting protein sites is high.
Advantages and disadvantages
  • Multiple protein dockings are also being accurately predicted.
  • Relies on folding information that is not available for much of the genomes.
  • To slow for a genome wide tool.
  • Low reliability

Sequence Method

The Sequence Method is an attempt to avoid the modelling of folding and docking altogether by using direct pattern recognition of the binding sequences.
Advantages and disadvantages
  • Fast enough to be used as genome wide tool.
  • Oversimplification is possible.

Graph Learning Method

The Graph Learning Method improves on the sequence method and its problems by programming a computer to learn what attributes are important for PPIP by identifying patterns in observed interactions. It then uses these attribute patterns for PPIP.
Advantages and disadvantages
  • Fast genome wide tool.
  • Good reliability

Vector Learning Method

The Vector Learning Method is an alternative to the Graph Learning Method and is currently competing for the title of most efficient method. Both machine learning methods are probably of equal potential. A training set is mapped to an n-dimensional space where successful combinations of residues or amino acids are represented in a hyperspace. Each piece of the pattern or residue attribute is mapped to a separate dimension “vectorization”. Unlike normal two dimensional (latitude and longitude) city maps, protein pattern maps are most effective when using more than 20 dimensions. If a potential protein pair lies within the space identified as successful an interaction is predicted.
Advantages and disadvantages
  • Fast genome wide tool.
  • Good reliability

Evolutionary Method

Because a large amount of work has been done on interactomes, the Evolutionary Method is becoming a practical speedup. It uses the data from PPIPs or experimentally verified interaction maps to infer protein interaction for evolutionarily related organisms.
Advantages and disadvantages
  • Relies on interactomes.
  • Fast genome wide tool.
  • Good reliability.

Validation

Predictions must be validated experimentally, however all experimental methods are costly and have numerous unavoidable associated error producing FN and FP. therefore choosing and understanding superior methods of verification is vary important

Significant results

many new drugs and biological understandings are developed starting with PPIP before moving on to experimental methods, saving time and millions of dollars in the process.

PPIP produces results that need biological verification and further exploration before the results can be used to cure diseases with new drugs or understanding.
The results are used heavily as a starting point for biological research where most of the metabolic pathway of interest is unknown.

Interpreting the results of PPIP can be problematic because of the volumes of data generated therefore, the data is often organised in a hierarchical manner, or an interactome
Interactome
Interactome is defined as the whole set of molecular interactions in cells. It is usually displayed as a directed graph. Molecular interactions can occur between molecules belonging to different biochemical families and also within a given family...

. The two best approaches are to simply display only one or two interaction links deep of a hierarchy at a time, the second is to assign the highly interactive (hub) proteins to be the roots of the interaction trees, creating groupings of functionally and spatially related proteins.

The main goal of proteomics
Proteomics
Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. The term "proteomics" was first coined in 1997 to make an analogy with...

 is to predict the structures, interactions and functions of the proteins. Specific function is only found through interactions. The prediction of protein-protein interactions is of vital interest in proteomics.

See also

  • Support vector machine
    Support vector machine
    A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

  • Parallel computation

  • Proteomics
    Proteomics
    Proteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. The term "proteomics" was first coined in 1997 to make an analogy with...

    • Molecular docking
      Molecular docking
      In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex...

    • Protein interactions
      Protein interactions
      Proteins can interact with many types of molecules. Such interactions are related to their function and are therefore an object of study in molecular biology, and of computational methods of prediction in bioinformatics.Protein interactions can be classified as:...

      • Protein-protein docking
        Protein-protein docking
        Macromolecular docking is the computational modelling of the quaternary structure of complexes formed by two or more interacting biological macromolecules...


  • Structural Bioinformatics
    Structural bioinformatics
    Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA...

    • Protein structure
      Protein structure
      Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...

      • Structural domain
    • Protein folding
      Protein folding
      Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil....

    • Threading (protein sequence)
      Threading (protein sequence)
      Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure.It differs from the homology modeling method of structure...


  • Computational genomics
    Computational genomics
    Computational genomics refers to the use of computational analysis to decipher biology from genome sequences and related data , including both DNA and RNA sequence as well as other "post-genomic" data...

  • Biological database
    Biological database
    Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray...

  • WikiProject Molecular and Cellular Biology

External links


-->
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK