Sequence database - AbsoluteAstronomy.com

In the field of bioinformatics

Bioinformatics

Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

, a sequence database is a large collection of computerized ("digital

Digital

A digital system is a data technology that uses discrete values. By contrast, non-digital systems use a continuous range of values to represent information...

") nucleic acid sequences, protein sequences, or other sequences stored on a computer. A database can include sequences from only one organism (e.g., a database for all proteins in Saccharomyces cerevisiae

Saccharomyces cerevisiae

Saccharomyces cerevisiae is a species of yeast. It is perhaps the most useful yeast, having been instrumental to baking and brewing since ancient times. It is believed that it was originally isolated from the skin of grapes...

), or it can include sequences from all organisms whose DNA

DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

has been sequenced.

Search issues

Sequence databases can be searched using a variety of methods. The most common is probably searching for a sequence similar to a certain target protein or gene whose sequence is already known to the user. The BLAST

BLAST

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

program is a method of this type.

Many inputs create inconsistencies

A major problem with all the large genetic sequence databases is that records are deposited in them from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, vary tremendously in quality. Also there is much redundancy, as multiple labs often submit numerous sequences that are identical, or nearly identical, to others in the databases.

Many annotations are based not on laboratory experiments, but on the results of sequence similarity searches for previously-annotated sequences. Of course, once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This leads to the transitive annotation problem because there may be several such annotation transfers by sequence similarity between a particular database record and actual wet lab experimental information. Therefore, one must always regard the biological annotations in major sequence databases with a considerable degree of skepticism, unless they can be verified by reference to published papers describing high-quality experimental data, or at least by reference to a human-curated sequence database.

Search issues

Many inputs create inconsistencies

See also

Major bioinformatics databases