Sequence database
Encyclopedia
In the field of bioinformatics
, a sequence database is a large collection of computerized ("digital
") nucleic acid sequences, protein sequences, or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae
), or it can include sequences from all organisms whose DNA
has been sequenced.
program is a method of this type.
Many annotations are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Of course, once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This leads to the transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information. Therefore, one must always regard the biological annotations in major sequence databases with a considerable degree of skepticism, unless they can be verified by reference to published papers describing high-quality experimental data, or at least by reference to a human-curated sequence database.
Distributed Computing
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, a sequence database is a large collection of computerized ("digital
Digital
A digital system is a data technology that uses discrete values. By contrast, non-digital systems use a continuous range of values to represent information...
") nucleic acid sequences, protein sequences, or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae
Saccharomyces cerevisiae
Saccharomyces cerevisiae is a species of yeast. It is perhaps the most useful yeast, having been instrumental to baking and brewing since ancient times. It is believed that it was originally isolated from the skin of grapes...
), or it can include sequences from all organisms whose DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...
has been sequenced.
Search issues
Sequence databases can be searched using a variety of methods. The most common is probably searching for a sequence similar to a certain target protein or gene whose sequence is already known to the user. The BLASTBLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
program is a method of this type.
Many inputs create inconsistencies
A major problem with all the large genetic sequence databases is that records are deposited in them from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, vary tremendously in quality. Also there is much redundancy, as multiple labs often submit numerous sequences that are identical, or nearly identical, to others in the databases.Many annotations are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Of course, once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This leads to the transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information. Therefore, one must always regard the biological annotations in major sequence databases with a considerable degree of skepticism, unless they can be verified by reference to published papers describing high-quality experimental data, or at least by reference to a human-curated sequence database.
See also
Database formats- FASTA formatFASTA formatIn bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...
Distributed Computing
- SIMAPSIMAPSimilarity Matrix of Proteins, or SIMAP, is a database of protein similarities created using distributed computing, which is freely accessible for scientific purposes...
- UniProtUniProtUniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...
the universal protein database, a central repository of protein data (Swiss-Prot & TrEMBL & PIR).
Major bioinformatics databases
- European Bioinformatics Institute databases
- NCBI completely sequenced genomes
- Stanford Saccharomyces Genome Database
- Protein, the NIH protein database, a collection of sequences from several sources, including translations from annotated coding regions in GenBankGenBankThe GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...
, RefSeqRefSeqThe Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...
and TPATPATPA may refer to:* Third-Party Audit, an Audit performed by a specialized & independent Organization* TaxPayers' Alliance, a British free-market lobby group* Tempe Preparatory Academy, a preparatory secondary school in Arizona, USA...
, as well as records from SwissProt, PIRPir-Finance:Pir may refer to:* Pier 1 Imports Stock symbol NYSE:PIR, a Texas-based retailer specializing in imported home furnishings and decor-Economics:PIR may refer to:* poverty index ratio, a measure of income relative to U.S...
, PRFPRFPRF is an acronym and can stand for:* Platelet Rich Fibrin, A combination of platelets and fibrin used to regenerate tissue* PRF-3, the call letters of the now defunct TV Tupi, Brazil's first television station* Pain Relief Foundation...
, and PDBProtein Data BankThe Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....