DbSNP
Encyclopedia
The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation
Genetic variation
Genetic variation, variation in alleles of genes, occurs both within and among populations. Genetic variation is important because it provides the “raw material” for natural selection. Genetic variation is brought about by mutation, a change in a chemical structure of a gene. Polyploidy is an...

 within and across different species
Species
In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...

 developed and hosted by the National Center for Biotechnology Information
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 (NCBI) in collaboration with the National Human Genome Research Institute
National Human Genome Research Institute
The National Human Genome Research Institute is a division of the National Institutes of Health, located in Bethesda, Maryland.NHGRI began as the National Center for Human Genome Research , which was established in 1989 to carry out the role of the NIH in the International Human Genome Project...

 (NHGRI). Although the name of the database implies a collection of one class of polymorphisms
Polymorphism (biology)
Polymorphism in biology occurs when two or more clearly different phenotypes exist in the same population of a species — in other words, the occurrence of more than one form or morph...

 only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...

, NCBI’s collection of publicly available nucleic acid and protein sequences.

As of build 131 (available February 2010), dbSNP had amassed over 184 million submissions representing more than 64 million distinct variants for 55 organisms, including Homo sapiens, Mus musculus, Oryza sativa, and many other species. A full list of organisms and the number of submissions for each can be found at: http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi

Purpose

dbSNP is an online resource implemented to aid biology
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...

 researchers. Its goal is to act as a single database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 that contains all identified genetic variation, which can be used to investigate a wide variety of genetically based natural phenomenon. Specifically, access to the molecular variation cataloged within dbSNP aids basic research such as physical mapping, population genetics
Population genetics
Population genetics is the study of allele frequency distribution and change under the influence of the four main evolutionary processes: natural selection, genetic drift, mutation and gene flow. It also takes into account the factors of recombination, population subdivision and population...

, investigations into evolutionary relationships, as well as being able to quickly and easily quantify the amount of variation at a given site of interest. In addition, dbSNP guides applied research in pharmacogenomics
Pharmacogenomics
Pharmacogenomics is the branch of pharmacology which deals with the influence of genetic variation on drug response in patients by correlating gene expression or single-nucleotide polymorphisms with a drug's efficacy or toxicity...

 and the association of genetic variation with phenotypic traits. According to the NCBI website, “The long-term investment in such novel and exciting research [dbSNP] promises not only to advance human biology but to revolutionise the practice of modern medicine.”

1. Source

dbSNP accepts submissions for any organism
Organism
In biology, an organism is any contiguous living system . In at least some form, all organisms are capable of response to stimuli, reproduction, growth and development, and maintenance of homoeostasis as a stable whole.An organism may either be unicellular or, as in the case of humans, comprise...

 from a wide variety of sources including individual research laboratories, collaborative polymorphism discovery efforts, large scale genome sequencing centers, other SNP databases (e.g. the SNP consortium, HapMap, etc.), and private businesses.

2. Types of records

Every submitted variation receives a submitted SNP ID number (“ss#”). This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; "refSNP cluster"). However, more than one record of a variation will likely be submitted to dbSNP, especially for clinically relevant variations. To accommodate this, dbSNP routinely assembles identical submitted SNP records into a single reference SNP record, which is also a unique and stable identifier (see below).

3. How to submit

To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission. Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table. Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

, cDNA, mitochondrial DNA, chloroplast
Chloroplast
Chloroplasts are organelles found in plant cells and other eukaryotic organisms that conduct photosynthesis. Chloroplasts capture light energy to conserve free energy in the form of ATP and reduce NADP to NADPH through a complex set of processes called photosynthesis.Chloroplasts are green...

 DNA), and organism. A sample submission sheet can be found at: (http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit#SECTION_TYPES)
Element Explanation
Flanking DNA Variations from assays must have 25 bp of flanking sequence on either side of the polymorphism and must be 100 bp overall.
Alleles Alleles must be defined using A, G, C, or T nomenclature; IUPAC nomenclature will only be accepted in flanking regions.
Method A description of how the variation was detected (e.g. DNA sequencing) or how the allele frequencies were calculated. A table of method classes is provided.
Population A description of the initial group from which the variation was found or from which the allele frequency was calculated. A table of population classes is provided.
Sample size The number of chromosomes used to find the variation and the number of chromosomes used to calculate allele frequencies.
Population-specific allele frequency The allele frequency of the surveyed population.
Population-specific genotype frequency The genotype frequency of the surveyed population.
Population-specific heterozygosity The proportion of individuals who are heterozygous for the variation.
Individual genotypes The genotype of individuals from the study.
Validation information The validation status lists the categories of evidence supporting the variation.

Release

New information obtained by dbSNP becomes available to the public periodically in a series of “builds” (i.e. revisions and releases of data). There is no schedule for releasing new builds; instead, builds are usually released when a new genome build becomes available, assuming that the genome has some cataloged variation associated with it. This occurs approximately every 1–2 months. Genome sequences often contain errors so reference SNPs (“refSNP”) from previous builds, as well as new submitted SNPs, are re-mapped to the newly available genome sequence through multiple cycles of BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...

 and MegaBLAST. Multiple submitted SNPs, if mapping to the same location, are clustered into one refSNP cluster and are assigned a reference SNP ID number. However, if two refSNP cluster records are found to map to the same location (i.e. are identical), then dbSNP will also merge those records together. In this case, the smallest refSNP number ID (i.e. the earliest record) would now represent both records, and the larger refSNP number IDs would become obsolete. These obsolete refSNP number IDs and are not used again for new records. When a merger of two refSNP records occurs, the change is tracked, and the former refSNP number IDs can still be used as a search query. This process of merging identical records together reduces redundancy within dbSNP.

There are two exceptions to the above merging criteria. First, if there exists two classes of variation at one site (e.g. a SNP and a DIP), then the two refSNP number IDs are not merged. Secondly, clinically important refSNPs that have been cited in the literature are termed “precious” and are never merged so as to prevent later confusion.

1. How to

The dbSNP can be searched using the Entrez SNP search tool (found at http://www.ncbi.nlm.nih.gov/projects/SNP/). A variety of queries can be used for searching: an ss number ID, a refSNP number ID, a gene name, an experimental method, a population class, a population detail, a publication, a marker, an allele, a chromosome, a base position, a heterozygosity range, a build number, or a strain. In addition, many results can be retrieved simultaneously using batch queries. Searches return refSNP number IDs that match the query term and a summary of the available information for that refSNP cluster.

2. Tools/Data

The information available for a refSNP cluster includes the basic information from each of the individual submissions (see “Submission”) as well as information available from combining the data from multiple submissions (e.g. heterozygosity, genotype frequencies). Many tools are available to examine a refSNP cluster in greater depth. Map view shows the position of the variation in the genome and other nearby variations. Another tool, gene view reports the location of the variation within a gene (if it is in a gene), the old and new codon, the amino acids encoded by both, and whether the change is synonymous or non-synonymous. Sequence viewer shows the position of the variant in relation to introns, exons, and other distant and close variants. 3D structure mapping, which shows 3D images of the encoded protein is also available.

The dbSNP is also linked to many other NCBI resources including the nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...

, protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

, gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

, taxonomy
Taxonomy
Taxonomy is the science of identifying and naming species, and arranging them into a classification. The field of taxonomy, sometimes referred to as "biological taxonomy", revolves around the description and use of taxonomic units, known as taxa...

 and structure databases, as well as PubMed
PubMed
PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...

, UniSTS, PMC
PMC
PMC may refer to:* Private military company, also known as a private military contractor* Persian Music Channel, a satellite TV network* El Tepual Airport , airport of Puerto Montt, Chile...

, OMIM, and UniGene.

3. Validation status

The validation status list the categories of evidence that support a variant. These include: (1) multiple independent submissions; (2) frequency or genotype data; (3) submitter confirmation; (4) observation of all alleles in at least two chromosomes; (5) genotyped by HapMap; and (6) sequenced in the 1000 Genomes Project.

Problems

The quality of the data found on dbSNP has been questioned by many research groups
, which suspect high false positive rates due to genotyping
Genotyping
Genotyping is the process of determining differences in the genetic make-up of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their...

 and base-calling errors. These mistakes can easily be entered into dbSNP if the submitter uses (1) uncritical bioinformatic alignments of highly similar but distinct DNA sequences, and/or (2) PCRs with primers that cannot discriminate between similar but distinct DNA sequences. Mitchell et al. (2004) reviewed four studies and concluded that dbSNP has a false positive rate between 15-17% for SNPs, and also that the minor allele
Allele
An allele is one of two or more forms of a gene or a genetic locus . "Allel" is an abbreviation of allelomorph. Sometimes, different alleles can result in different observable phenotypic traits, such as different pigmentation...

frequency is greater than 10% for approximately 80% of the SNPs that are not false positives. Similarly, Musemeci et al. (2010) states that as many as 8.32% of the biallelic coding SNPs in dbSNP are artifacts of highly similar DNA sequences (i.e. paralogous genes) and refer to these entries as single nucleotide differences (SNDs). The high error rates in dbSNP may not be surprising: of the 23.7 million refSNP entries for humans, only 14.5 million have been validated, leaving the remaining 9.2 million as candidate SNPs. However, according to Musemeci et al. (2010), even the validation code provided in the refSNP record is only partially useful: only HapMap validation reduced the number of SNDs (3% vs 8%), but only accepting this method removes more than half of the real SNPs in the dbSNP. These authors also note that one source of submissions from the Lee group are plagued with errors: 20% of these submissions are SNDs (vs. 8% for submissions). However, as the authors note, ignoring all of these submissions would remove many real SNPs.

Errors in the dbSNP can hamper candidate gene association studies and haplotype-based investigations. Errors may also increase false conclusions in association studies: increasing the number of SNPs that are tested by testing false SNPs requires more hypothesis tests. However, these false SNPs cannot actually be associated with traits, so the alpha level is decreased more than is necessary for a rigorous test if only the true SNPs were tested and the false negative rate will increase. Musemeci et al. (2010) suggested that authors of negative association studies inspect their previous studies for false SNPs (SNDs), which could be removed from analysis.

How to cite data from dbSNP

Individual sequences can be referred to by their refSNP cluster ID numbers (e.g. rs206437). dbSNP should be referenced using the 2001 Sherry et al. paper: Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acid Research, 29: 308-311.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK