Bioinformatics
Encyclopedia
Bioinformatics is the application of computer science
and information technology
to the field of biology
and medicine
. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics, for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).
Commonly used software tools and technologies in this field include Java
, XML
, Perl
, C
, C++
, Python
, R
, SQL
, CUDA
and MATLAB
.
In order to study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques (e.g., pattern recognition
, data mining
, machine learning
algorithms, and visualization
) to achieve this goal. Major research efforts in the field include sequence alignment
, gene finding, genome assembly, drug design
, drug discovery
, protein structure alignment, protein structure prediction
, prediction of gene expression
and protein–protein interactions, genome-wide association studies and the modeling of evolution
.
Interestingly, the term bioinformatics was coined before the "genomic revolution". Paulien Hogeweg
and Ben Hesper introduced the term in 1978 to refer to "the study of information processes in biotic systems"., a definition that placed bioinformatics as a field parallel to biophysics or biochemistry (e.g. biochemistry is the study of chemical processes in biological systems) . However, its primary use since at least the late 1980s has been to describe the application of computer science and information sciences to the analysis of biological data, particularly in those areas of genomics involving large-scale DNA sequencing
.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of biological data.
Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. It is the name given to these mathematical and computing approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA
and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures.
There are two fundamental ways of modelling a Biological system (e.g. living cell) both coming under Bioinformatic approaches.
A broad sub-category under bioinformatics is structural bioinformatics
.
in 1977, the DNA sequence
s of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode polypeptides (proteins), RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species
or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic tree
s). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer program
s such as BLAST
are used daily to search sequences from more than 260 000 organisms, containing over 190 billion nucleotides. These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify sequences that are related, but not identical. A variant of this sequence alignment
is used in the sequencing process itself. The so-called shotgun sequencing
technique (which was used, for example, by The Institute for Genomic Research
to sequence the first bacterial genome, Haemophilus influenzae
) does not produce entire chromosomes, but instead generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome
, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly will usually contain numerous gaps that have to be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is annotation, which involves computational gene finding to search for protein-coding genes, RNA genes, and other functional sequences within a genome. Not all of the nucleotides within a genome are part of genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome
projects — for example, in the use of DNA sequences for protein identification.
, annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team at The Institute for Genomic Research
that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae
. Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving.
, as well as their change over time. Informatics
has assisted evolutionary biologists in several key ways; it has enabled researchers to:
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science
that uses genetic algorithm
s is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.
The area of research draws from statistics
and computational linguistics
.
of many genes can be determined by measuring mRNA
levels with multiple techniques including microarrays
, expressed cDNA sequence tag
(EST) sequencing, serial analysis of gene expression
(SAGE) tag sequencing, massively parallel signature sequencing
(MPSS), RNA-Seq
, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise
in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.
and leading to an increase or decrease in the activity of one or more protein
s. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motif
s in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray
data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle
, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.
s and high throughput (HT) mass spectrometry
(MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.
s in a variety of gene
s in cancer
. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome
sequences and germline
polymorphisms. New physical detection technologies are employed, such as oligonucleotide
microarrays to identify chromosomal gains and losses (called comparative genomic hybridization
), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabyte
s of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise
, and thus Hidden Markov model
and change-point analysis methods are being developed to infer real copy number changes.
Another type of data that requires novel informatics development is the analysis of lesion
s found to be recurrent among many tumors.
(orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo
algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
s of cellular
subsystems (such as the networks of metabolites
and enzyme
s which comprise metabolism
, signal transduction
pathways and gene regulatory network
s) to both analyze and visualize the complex connections of these cellular processes. Artificial life
or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:
sequence of a protein, the so-called primary structure
, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy
– a.k.a. Mad Cow Disease – prion
.) Knowledge of this structure is vital in understanding the function of the protein. For lack of better terms, structural information is usually classified as one of secondary
, tertiary
and quaternary
structure. A viable general solution to such predictions remains an open problem. As of now, most efforts have been directed towards heuristics that work most of the time.
One of the key ideas in bioinformatics is the notion of homology
. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin
). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also: structural motif
and structural domain.
Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms for studying molecular interactions.
See also: protein–protein interaction prediction.
In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography
and Protein nuclear magnetic resonance spectroscopy
(protein NMR). One central question for the biological scientist is whether it is practical to predict possible protein–protein interactions only based on these 3D shapes, without doing protein–protein interaction experiments. A variety of methods have been developed to tackle the Protein–protein docking problem, though it seems that there is still much work to be done in this field.
tools have existed and continued to grow since the 1980s. The combination of a continued need for new algorithm
s for the analysis of emerging types of biological readouts, the potential for innovative in silico
experiments, and freely available open code bases have helped to create opportunities for all research groups to contribute to both bioinformatics and the range of open source software available, regardless of their funding arrangements. The open source tools often act as incubators of ideas, or community-supported plug-ins in commercial applications. They may also provide de facto
standards and shared object models for assisting with the challenge of bioinformation integration.
The range of open source software packages includes titles such as Bioconductor
, BioPerl
, BioJava
, BioRuby
, Bioclipse
, EMBOSS
, Taverna workbench
, and UGENE
. In order to maintain this tradition and create further opportunities, the non-profit Open Bioinformatics Foundation
have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.
and REST
-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads.
Basic bioinformatics services are classified by the EBI
into three categories: SSS
(Sequence Search Services), MSA
(Multiple Sequence Alignment) and BSA (Biological Sequence Analysis). The availability of these service-oriented
bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems
.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and information technology
Information technology
Information technology is the acquisition, processing, storage and dissemination of vocal, pictorial, textual and numerical information by a microelectronics-based combination of computing and telecommunications...
to the field of biology
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...
and medicine
Medicine
Medicine is the science and art of healing. It encompasses a variety of health care practices evolved to maintain and restore health by the prevention and treatment of illness....
. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics, for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).
Commonly used software tools and technologies in this field include Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
, Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
, C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
, Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
, R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....
, SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
, CUDA
CUDA
CUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia graphics processing units that is accessible to software developers through variants of industry standard programming languages...
and MATLAB
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...
.
Introduction
Bioinformatics was applied in the creation and maintenance of a database to store biological information at the beginning of the "genomic revolution", such as Nucleotide sequences and Amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.In order to study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:
- the development and implementation of tools that enable efficient access to, and use and management of, various types of information.
- the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques (e.g., pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
algorithms, and visualization
Biological data visualization
Biology Data Visualization is a branch of bioinformatics concerned with the application of computer graphics, scientific visualization, and information visualization to different areas of the life sciences. This includes visualization of sequences, genomes, alignments, phylogenies, macromolecular...
) to achieve this goal. Major research efforts in the field include sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
, gene finding, genome assembly, drug design
Drug design
Drug design, also sometimes referred to as rational drug design or structure-based drug design, is the inventive process of finding new medications based on the knowledge of the biological target...
, drug discovery
Drug discovery
In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which drugs are discovered or designed.In the past most drugs have been discovered either by identifying the active ingredient from traditional remedies or by serendipitous discovery...
, protein structure alignment, protein structure prediction
Protein structure prediction
Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...
, prediction of gene expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...
and protein–protein interactions, genome-wide association studies and the modeling of evolution
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...
.
Interestingly, the term bioinformatics was coined before the "genomic revolution". Paulien Hogeweg
Paulien Hogeweg
Paulien Hogeweg is a Dutch theoretical biologist and complex systems researcher studying biological systems as dynamic information processing systems at many interconnected levels. Together with Ben Hesper she coined the term Bioinformatics in 1978 as the study of informatic processes in biotic...
and Ben Hesper introduced the term in 1978 to refer to "the study of information processes in biotic systems"., a definition that placed bioinformatics as a field parallel to biophysics or biochemistry (e.g. biochemistry is the study of chemical processes in biological systems) . However, its primary use since at least the late 1980s has been to describe the application of computer science and information sciences to the analysis of biological data, particularly in those areas of genomics involving large-scale DNA sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....
.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of biological data.
Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. It is the name given to these mathematical and computing approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures.
There are two fundamental ways of modelling a Biological system (e.g. living cell) both coming under Bioinformatic approaches.
- Static
- Sequences – Proteins, Nucleic acids and Peptides
- Structures – Proteins, Nucleic acids, Ligands (including metabolites and drugs) and Peptides
- Interaction data among the above entities including microarray data and Networks of proteins, metabolites
- Dynamic
- Systems Biology comes under this category including reaction fluxes and variable concentrations of metabolites
- Multi-Agent Based modelling approaches capturing cellular events such as signalling, transcription and reaction dynamics
A broad sub-category under bioinformatics is structural bioinformatics
Structural bioinformatics
Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA...
.
Sequence analysis
Since the Phage Φ-X174 was sequencedSequencing
In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer...
in 1977, the DNA sequence
DNA sequence
The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. Because nucleic acids, such as DNA and RNA, are unbranched polymers, this specification is equivalent to specifying the sequence of...
s of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode polypeptides (proteins), RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species
Species
In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...
or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic tree
Phylogenetic tree
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical and/or genetic characteristics...
s). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
s such as BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
are used daily to search sequences from more than 260 000 organisms, containing over 190 billion nucleotides. These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify sequences that are related, but not identical. A variant of this sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
is used in the sequencing process itself. The so-called shotgun sequencing
Shotgun sequencing
In genetics, shotgun sequencing, also known as shotgun cloning, is a method used for sequencing long DNA strands. It is named by analogy with the rapidly-expanding, quasi-random firing pattern of a shotgun....
technique (which was used, for example, by The Institute for Genomic Research
The Institute for Genomic Research
The Institute for Genomic Research was a non-profit genomics research institute founded in 1992 by Craig Venter in Rockville, Maryland, United States. It is now a part of the J. Craig Venter Institute.-History:...
to sequence the first bacterial genome, Haemophilus influenzae
Haemophilus influenzae
Haemophilus influenzae, formerly called Pfeiffer's bacillus or Bacillus influenzae, Gram-negative, rod-shaped bacterium first described in 1892 by Richard Pfeiffer during an influenza pandemic. A member of the Pasteurellaceae family, it is generally aerobic, but can grow as a facultative anaerobe. H...
) does not produce entire chromosomes, but instead generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome
Human genome
The human genome is the genome of Homo sapiens, which is stored on 23 chromosome pairs plus the small mitochondrial DNA. 22 of the 23 chromosomes are autosomal chromosome pairs, while the remaining pair is sex-determining...
, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly will usually contain numerous gaps that have to be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is annotation, which involves computational gene finding to search for protein-coding genes, RNA genes, and other functional sequences within a genome. Not all of the nucleotides within a genome are part of genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome
Proteome
The proteome is the entire set of proteins expressed by a genome, cell, tissue or organism. More specifically, it is the set of expressed proteins in a given type of cells or an organism at a given time under defined conditions. The term is a portmanteau of proteins and genome.The term has been...
projects — for example, in the use of DNA sequences for protein identification.
Genome annotation
In the context of genomicsGenomics
Genomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...
, annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team at The Institute for Genomic Research
The Institute for Genomic Research
The Institute for Genomic Research was a non-profit genomics research institute founded in 1992 by Craig Venter in Rockville, Maryland, United States. It is now a part of the J. Craig Venter Institute.-History:...
that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae
Haemophilus influenzae
Haemophilus influenzae, formerly called Pfeiffer's bacillus or Bacillus influenzae, Gram-negative, rod-shaped bacterium first described in 1892 by Richard Pfeiffer during an influenza pandemic. A member of the Pasteurellaceae family, it is generally aerobic, but can grow as a facultative anaerobe. H...
. Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving.
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of speciesSpecies
In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...
, as well as their change over time. Informatics
Informatics (academic field)
Informatics is the science of information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, algorithms, behavior, and interactions of natural and artificial systems that store, process, access and communicate information...
has assisted evolutionary biologists in several key ways; it has enabled researchers to:
- trace the evolution of a large number of organisms by measuring changes in their DNADNADeoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
, rather than through physical taxonomy or physiological observations alone, - more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplicationGene duplicationGene duplication is any duplication of a region of DNA that contains a gene; it may occur as an error in homologous recombination, a retrotransposition event, or duplication of an entire chromosome.The second copy of the gene is often free from selective pressure — that is, mutations of it have no...
, horizontal gene transferHorizontal gene transferHorizontal gene transfer , also lateral gene transfer , is any process in which an organism incorporates genetic material from another organism without being the offspring of that organism...
, and the prediction of factors important in bacterial speciationSpeciationSpeciation is the evolutionary process by which new biological species arise. The biologist Orator F. Cook seems to have been the first to coin the term 'speciation' for the splitting of lineages or 'cladogenesis,' as opposed to 'anagenesis' or 'phyletic evolution' occurring within lineages...
, - build complex computational models of populations to predict the outcome of the system over time
- track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
that uses genetic algorithm
Genetic algorithm
A genetic algorithm is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems...
s is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.
Literature analysis
The growth in the number of published literature makes it virtually impossible to read every paper, resulting in disjointed subfields of research. Literature analysis aims to employ computational and statistical linguistics to mine this growing library of text resources. For example:- abbreviation recognition - identify the long-form and abbreviation of biological terms,
- named entity recognition - recognizing biological terms such as gene names
- protein-protein interaction - identify which proteins interact with which proteins from text
The area of research draws from statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
and computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
.
Analysis of gene expression
The expressionGene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...
of many genes can be determined by measuring mRNA
Messenger RNA
Messenger RNA is a molecule of RNA encoding a chemical "blueprint" for a protein product. mRNA is transcribed from a DNA template, and carries coding information to the sites of protein synthesis: the ribosomes. Here, the nucleic acid polymer is translated into a polymer of amino acids: a protein...
levels with multiple techniques including microarrays
DNA microarray
A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome...
, expressed cDNA sequence tag
Expressed sequence tag
An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now available in...
(EST) sequencing, serial analysis of gene expression
Serial Analysis of Gene Expression
Serial analysis of gene expression is a technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. The original technique was developed by Dr. Victor Velculescu...
(SAGE) tag sequencing, massively parallel signature sequencing
Massively parallel signature sequencing
Massive parallel signature sequencing is a sequenced based approach that can be used to identify and quantify mRNA transcripts present in a sample similar to serial analysis of gene expression but the biochemical manipulation and sequencing approach differ substantially.MPSS allows mRNA...
(MPSS), RNA-Seq
RNA-Seq
RNA-seq, also called "Whole Transcriptome Shotgun Sequencing" and dubbed "a revolutionary tool for transcriptomics", refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content, a technique that is quickly becoming...
, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise
Noise
In common use, the word noise means any unwanted sound. In both analog and digital electronics, noise is random unwanted perturbation to a wanted signal; it is called noise as a generalisation of the acoustic noise heard when listening to a weak radio transmission with significant electrical noise...
in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.
Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a hormoneHormone
A hormone is a chemical released by a cell or a gland in one part of the body that sends out messages that affect cells in other parts of the organism. Only a small amount of hormone is required to alter cell metabolism. In essence, it is a chemical messenger that transports a signal from one...
and leading to an increase or decrease in the activity of one or more protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
s. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motif
Sequence motif
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...
s in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray
Microarray
A microarray is a multiplex lab-on-a-chip. It is a 2D array on a solid substrate that assays large amounts of biological material using high-throughput screening methods.Types of microarrays include:...
data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle
Cell cycle
The cell cycle, or cell-division cycle, is the series of events that takes place in a cell leading to its division and duplication . In cells without a nucleus , the cell cycle occurs via a process termed binary fission...
, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.
Analysis of protein expression
Protein microarrayProtein microarray
A protein microarray, sometimes referred to as a protein binding microarray,provides a multiplex approach to identify protein–protein interactions, to identify the substrates of protein kinases, to identify transcription factor protein-activation, or to identify the targets of biologically active...
s and high throughput (HT) mass spectrometry
Mass spectrometry
Mass spectrometry is an analytical technique that measures the mass-to-charge ratio of charged particles.It is used for determining masses of particles, for determining the elemental composition of a sample or molecule, and for elucidating the chemical structures of molecules, such as peptides and...
(MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutationPoint mutation
A point mutation, or single base substitution, is a type of mutation that causes the replacement of a single base nucleotide with another nucleotide of the genetic material, DNA or RNA. Often the term point mutation also includes insertions or deletions of a single base pair...
s in a variety of gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...
s in cancer
Cancer
Cancer , known medically as a malignant neoplasm, is a large group of different diseases, all involving unregulated cell growth. In cancer, cells divide and grow uncontrollably, forming malignant tumors, and invade nearby parts of the body. The cancer may also spread to more distant parts of the...
. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome
Human genome
The human genome is the genome of Homo sapiens, which is stored on 23 chromosome pairs plus the small mitochondrial DNA. 22 of the 23 chromosomes are autosomal chromosome pairs, while the remaining pair is sex-determining...
sequences and germline
Germline
In biology and genetics, the germline of a mature or developing individual is the line of germ cells that have genetic material that may be passed to a child.For example, gametes such as the sperm or the egg, are part of the germline...
polymorphisms. New physical detection technologies are employed, such as oligonucleotide
Oligonucleotide
An oligonucleotide is a short nucleic acid polymer, typically with fifty or fewer bases. Although they can be formed by bond cleavage of longer segments, they are now more commonly synthesized, in a sequence-specific manner, from individual nucleoside phosphoramidites...
microarrays to identify chromosomal gains and losses (called comparative genomic hybridization
Comparative genomic hybridization
Comparative genomic hybridization or Chromosomal Microarray Analysis is a molecular-cytogenetic method for the analysis of copy number changes in the DNA content of a given subject's DNA and often in tumor cells....
), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabyte
Terabyte
The terabyte is a multiple of the unit byte for digital information. The prefix tera means 1012 in the International System of Units , and therefore 1 terabyte is , or 1 trillion bytes, or 1000 gigabytes. 1 terabyte in binary prefixes is 0.9095 tebibytes, or 931.32 gibibytes...
s of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise
Noise
In common use, the word noise means any unwanted sound. In both analog and digital electronics, noise is random unwanted perturbation to a wanted signal; it is called noise as a generalisation of the acoustic noise heard when listening to a weak radio transmission with significant electrical noise...
, and thus Hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...
and change-point analysis methods are being developed to infer real copy number changes.
Another type of data that requires novel informatics development is the analysis of lesion
Lesion
A lesion is any abnormality in the tissue of an organism , usually caused by disease or trauma. Lesion is derived from the Latin word laesio which means injury.- Types :...
s found to be recurrent among many tumors.
Comparative genomics
The core of comparative genome analysis is the establishment of the correspondence between genesGênes
Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...
(orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...
algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
Modeling biological systems
Systems biology involves the use of computer simulationComputer simulation
A computer simulation, a computer model, or a computational model is a computer program, or network of computers, that attempts to simulate an abstract model of a particular system...
s of cellular
Cell (biology)
The cell is the basic structural and functional unit of all known living organisms. It is the smallest unit of life that is classified as a living thing, and is often called the building block of life. The Alberts text discusses how the "cellular building blocks" move to shape developing embryos....
subsystems (such as the networks of metabolites
Metabolic network
A metabolic network is the complete set of metabolic and physical processes that determine the physiological and biochemical properties of a cell...
and enzyme
Enzyme
Enzymes are proteins that catalyze chemical reactions. In enzymatic reactions, the molecules at the beginning of the process, called substrates, are converted into different molecules, called products. Almost all chemical reactions in a biological cell need enzymes in order to occur at rates...
s which comprise metabolism
Metabolism
Metabolism is the set of chemical reactions that happen in the cells of living organisms to sustain life. These processes allow organisms to grow and reproduce, maintain their structures, and respond to their environments. Metabolism is usually divided into two categories...
, signal transduction
Signal transduction
Signal transduction occurs when an extracellular signaling molecule activates a cell surface receptor. In turn, this receptor alters intracellular molecules creating a response...
pathways and gene regulatory network
Gene regulatory network
A gene regulatory network or genetic regulatory network is a collection of DNA segments in a cell whichinteract with each other indirectly and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA.In general, each mRNA molecule goes...
s) to both analyze and visualize the complex connections of these cellular processes. Artificial life
Artificial life
Artificial life is a field of study and an associated art form which examine systems related to life, its processes, and its evolution through simulations using computer models, robotics, and biochemistry. The discipline was named by Christopher Langton, an American computer scientist, in 1986...
or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
High-throughput image analysis
Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery. Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivityObjectivity (science)
Objectivity in science is a value that informs how science is practiced and how scientific truths are created. It is the idea that scientists, in attempting to uncover truths about the natural world, must aspire to eliminate personal biases, a priori commitments, emotional involvement, etc...
, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:
- high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology, Bioimage informaticsBioimage informaticsBioimage informatics is a subfield of bioinformatics and computational biology. It is most related to using computational techniques to analyze bioimages, especially cellular and molecular images, at large scale and high throughput, with the goal to mine useful knowledge out of complicated and...
) - morphometricsMorphometricsMorphometrics refers to the quantitative analysis of form, a concept that encompasses size and shape. Morphometric analyses are commonly performed on organisms, and are useful in analyzing their fossil record, the impact of mutations on shape, developmental changes in form, covariances between...
- clinical image analysis and visualization
- determining the real-time air-flow patterns in breathing lungs of living animals
- quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
- making behavioral observations from extended video recordings of laboratory animals
- infrared measurements for metabolic activity determination
- inferring clone overlaps in DNA mappingGene mappingGene mapping, also called genome mapping, is the creation of a genetic map assigning DNA fragments to chromosomes.When a genome is first investigated, this map is nonexistent. The map improves with the scientific progress and is perfect when the genomic DNA sequencing of the species has been...
, e.g. the Sulston scoreSulston scoreThe Sulston Score is an equation used in DNA mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance. Used as such, it is a test of statistical significance...
Prediction of protein structure
Protein structure prediction is another important application of bioinformatics. The amino acidAmino acid
Amino acids are molecules containing an amine group, a carboxylic acid group and a side-chain that varies between different amino acids. The key elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen...
sequence of a protein, the so-called primary structure
Primary structure
The primary structure of peptides and proteins refers to the linear sequence of its amino acid structural units. The term "primary structure" was first coined by Linderstrøm-Lang in 1951...
, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy
Bovine spongiform encephalopathy
Bovine spongiform encephalopathy , commonly known as mad-cow disease, is a fatal neurodegenerative disease in cattle that causes a spongy degeneration in the brain and spinal cord. BSE has a long incubation period, about 30 months to 8 years, usually affecting adult cattle at a peak age onset of...
– a.k.a. Mad Cow Disease – prion
Prion
A prion is an infectious agent composed of protein in a misfolded form. This is in contrast to all other known infectious agents which must contain nucleic acids . The word prion, coined in 1982 by Stanley B. Prusiner, is a portmanteau derived from the words protein and infection...
.) Knowledge of this structure is vital in understanding the function of the protein. For lack of better terms, structural information is usually classified as one of secondary
Secondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...
, tertiary
Tertiary structure
In biochemistry and molecular biology, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates.-Relationship to primary structure:...
and quaternary
Quaternary structure
In biochemistry, quaternary structure is the arrangement of multiple folded protein or coiling protein molecules in a multi-subunit complex.-Description and examples:...
structure. A viable general solution to such predictions remains an open problem. As of now, most efforts have been directed towards heuristics that work most of the time.
One of the key ideas in bioinformatics is the notion of homology
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...
. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin
Leghemoglobin
Leghemoglobin is a nitrogen or oxygen carrier, because naturally occurring oxygen and nitrogen interact similarly with this protein; and a hemoprotein found in the nitrogen-fixing root nodules of leguminous plants. But nitrogen is necessary for the cycle to occur...
). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also: structural motif
Structural motif
In a chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a supersecondary structure, which appears also in a variety of other molecules...
and structural domain.
Molecular Interaction
Efficient software is available today for studying interactions among proteins, ligands and peptides. Types of interactions most often encountered in the field include – Protein–ligand (including drug), protein–protein and protein–peptide.Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms for studying molecular interactions.
See also: protein–protein interaction prediction.
Docking algorithms
In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography
X-ray crystallography
X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and causes the beam of light to spread into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a...
and Protein nuclear magnetic resonance spectroscopy
Protein nuclear magnetic resonance spectroscopy
Nuclear magnetic resonance spectroscopy of proteins is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins. The field was pioneered by Richard R. Ernst and Kurt Wüthrich, among others...
(protein NMR). One central question for the biological scientist is whether it is practical to predict possible protein–protein interactions only based on these 3D shapes, without doing protein–protein interaction experiments. A variety of methods have been developed to tackle the Protein–protein docking problem, though it seems that there is still much work to be done in this field.
Software and tools
Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions.Open source bioinformatics software
Many free and open source softwareFree and open source software
Free and open-source software or free/libre/open-source software is software that is liberally licensed to grant users the right to use, study, change, and improve its design through the availability of its source code...
tools have existed and continued to grow since the 1980s. The combination of a continued need for new algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
s for the analysis of emerging types of biological readouts, the potential for innovative in silico
In silico
In silico is an expression used to mean "performed on computer or via computer simulation." The phrase was coined in 1989 as an analogy to the Latin phrases in vivo and in vitro which are commonly used in biology and refer to experiments done in living organisms and outside of living organisms,...
experiments, and freely available open code bases have helped to create opportunities for all research groups to contribute to both bioinformatics and the range of open source software available, regardless of their funding arrangements. The open source tools often act as incubators of ideas, or community-supported plug-ins in commercial applications. They may also provide de facto
De facto
De facto is a Latin expression that means "concerning fact." In law, it often means "in practice but not necessarily ordained by law" or "in practice or actuality, but not officially established." It is commonly used in contrast to de jure when referring to matters of law, governance, or...
standards and shared object models for assisting with the challenge of bioinformation integration.
The range of open source software packages includes titles such as Bioconductor
Bioconductor
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology....
, BioPerl
BioPerl
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project....
, BioJava
BioJava
The BioJava Project is an open source project dedicated to providing Java tools for processing biological data. This includes include objects for manipulating sequences, protein structures, file parsers, CORBA interoperability, DAS, access to AceDB, dynamic programming, and simple statistical...
, BioRuby
BioRuby
BioRuby is a package of Open Source Ruby code, with classes for DNA and protein sequence analysis, alignment, database parsing, and other Bioinformatics tools. Recently, tools for structural biology have been added.-External links:* ,*...
, Bioclipse
Bioclipse
The Bioclipse project is a Java-based, open source, visual platform for chemo- and bioinformatics based on the Eclipse Rich Client Platform...
, EMBOSS
EMBOSS
EMBOSS is an acronym for European Molecular Biology Open Software Suite. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology and bioinformatics user community...
, Taverna workbench
Taverna workbench
Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...
, and UGENE
UGENE
UGENE is free open-source cross-platform bioinformatics software.It integrates dozens of well-known biological tools and algorithms, providing both graphical user and command line interfaces...
. In order to maintain this tradition and create further opportunities, the non-profit Open Bioinformatics Foundation
Open Bioinformatics Foundation
The Open Bioinformatics Foundation is a non profit, volunteer run organization focused on supporting open source programming in bioinformatics....
have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.
Web services in bioinformatics
SOAPSOAP
SOAP, originally defined as Simple Object Access Protocol, is a protocol specification for exchanging structured information in the implementation of Web Services in computer networks...
and REST
Rest
Rest may refer to:* Leisure* Human relaxation* SleepRest may also refer to:* Rest , a pause in a piece of music* Rest , the relation between two observers* Rest , a 2008 album by Gregor Samsa...
-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads.
Basic bioinformatics services are classified by the EBI
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...
into three categories: SSS
Sequence alignment software
This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment...
(Sequence Search Services), MSA
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...
(Multiple Sequence Alignment) and BSA (Biological Sequence Analysis). The availability of these service-oriented
Service-orientation
Service-orientation is a design paradigm to build computer software in the form of services. Like other design paradigms , service-orientation provides a governing approach to automate business logic as distributed systems...
bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems
Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....
.
See also
Further reading
- Achuthsankar S Nair Computational Biology & Bioinformatics – A gentle Overview, Communications of Computer Society of India, January 2007
- Aluru, Srinivas, ed. Handbook of Computational Molecular Biology. Chapman & Hall/Crc, 2006. ISBN 1584884061 (Chapman & Hall/Crc Computer and Information Science Series)
- Baldi, P and Brunak, S, Bioinformatics: The Machine Learning Approach, 2nd edition. MIT Press, 2001. ISBN 0-262-02506-X
- Barnes, M.R. and Gray, I.C., eds., Bioinformatics for Geneticists, first edition. Wiley, 2003. ISBN 0-470-84394-2
- Baxevanis, A.D. and Ouellette, B.F.F., eds., Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, third edition. Wiley, 2005. ISBN 0-471-47878-4
- Baxevanis, A.D., Petsko, G.A., Stein, L.D., and Stormo, G.D., eds., Current ProtocolsCurrent ProtocolsCurrent Protocols is a series of laboratory manuals for life scientists. The flagship title, Current Protocols in Molecular Biology, was first published in 1987. It was conceived by a group of researchers at Harvard Medical School and Massachusetts General Hospital...
in Bioinformatics. Wiley, 2007. ISBN 0-471-25093-7 - Claverie, J.M. and C. Notredame, Bioinformatics for Dummies. Wiley, 2003. ISBN 0-7645-1696-5
- Cristianini, N. and Hahn, M. Introduction to Computational Genomics, Cambridge University Press, 2006. (ISBN 9780521671910 | ISBN 0521671914)
- Durbin, R., S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis. Cambridge University Press, 1998. ISBN 0-521-62971-3
- Gilbert, D. Bioinformatics software resources. Briefings in Bioinformatics, Briefings in Bioinformatics, 2004 5(3):300–304.
- Keedwell, E., Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems. Wiley, 2005. ISBN 0-470-02175-6
- Kohane, et al. Microarrays for an Integrative Genomics. The MIT Press, 2002. ISBN 0-262-11271-X
- Lund, O. et al. Immunological Bioinformatics. The MIT Press, 2005. ISBN 0-262-12280-4
- Michael S. Waterman, Introduction to Computational Biology: Sequences, Maps and Genomes. CRC Press, 1995. ISBN 0-412-99391-0
- Mount, David W. Bioinformatics: Sequence and Genome Analysis Spring Harbor Press, May 2002. ISBN 0-87969-608-7
- Pachter, Lior and Sturmfels, BerndBernd SturmfelsBernd Sturmfels is a Professor of Mathematics and Computer Science at the University of California, Berkeley.He received his PhD in 1987 from the University of Washington and the Technische Universität Darmstadt...
. "Algebraic Statistics for Computational Biology" Cambridge University Press, 2005. ISBN 0-521-85700-7 - Pevzner, Pavel A. Computational Molecular Biology: An Algorithmic Approach The MIT Press, 2000. ISBN 0-262-16197-4
- Soinov, L. Bioinformatics and Pattern Recognition Come Together Journal of Pattern Recognition Research (JPRR), Vol 1 (1) 2006 p. 37–41
- Tisdall, James. "Beginning Perl for Bioinformatics" O'Reilly, 2001. ISBN 0-596-00080-4
- Dedicated issue of Philosophical Transactions B on Bioinformatics freely available
- Catalyzing Inquiry at the Interface of Computing and Biology (2005) CSTB report
- Calculating the Secrets of Life: Contributions of the Mathematical Sciences and computing to Molecular Biology (1995)
- Foundations of Computational and Systems Biology MIT Course
- Computational Biology: Genomes, Networks, Evolution Free MIT Course
- Algorithms for Computational Biology Free MIT Course
- Zhang, Z., Cheung, K.H. and Townsend, J.P. Bringing Web 2.0 to bioinformatics, Briefing in Bioinformatics. In press