DNA binding site
Encyclopedia
DNA binding sites are a type of binding site
Binding site
In biochemistry, a binding site is a region on a protein, DNA, or RNA to which specific other molecules and ions—in this context collectively called ligands—form a chemical bond...

 found in DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding protein
DNA-binding protein
DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that...

s. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation
Transcriptional regulation
Transcriptional regulation is the change in gene expression levels by altering transcription rates. -Regulation of transcription:Regulation of transcription controls when transcription occurs and how much RNA is created...

. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome
Cistrome
CistromeThis term http://cistrome.pbwiki.com was coined by investigators at the Dana-Farber Cancer Institute and Harvard Medical School to define the set of cis-acting targets of a trans-acting factor on a genome scale...

. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases (see site-specific recombination
Site-specific recombination
Site-specific recombination, also known as conservative site-specific recombination, is a type of genetic recombination in which DNA strand exchange takes place between segments possessing only a limited degree of sequence homology...

) and methyltransferase
Methyltransferase
A methyltransferase is a type of transferase enzyme that transfers a methyl group from a donor to an acceptor.Methylation often occurs on nucleic bases in DNA or amino acids in protein structures...

s.

DNA binding sites can be thus defined as short DNA sequences (typically 4 to 30 base pairs long, but up to 200 bp for recombination sites) that are specifically bound by one or more DNA-binding protein
DNA-binding protein
DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that...

s or protein complexes.

Types of DNA binding sites

DNA binding sites can be categorized according to their biological function. Thus, we can distinguish between transcription factor-binding sites, restriction sites and recombination sites. Some authors have proposed that binding sites could also be classified according to their most convenient mode of representation. On the one hand, restriction sites can be generally represented by consensus sequences. This is because they target mostly identical sequences and restriction efficiency decreases abruptly for less similar sequences. On the other hand, DNA binding sites for a given transcription factor are usually all different, with varying degrees of affinity of the transcription factor for the different binding sites. This makes it difficult to accurately represent transcription factor binding sites using consensus sequences, and they are typically represented using position specific frequency matrices (PSFM), which are often graphically depicted using sequence logo
Sequence logo
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides or amino acids .-Logo creation:...

s. This argument, however, is partly arbitrary. Restriction enzymes, like transcription factors, yield a gradual, though sharp, range of affinities for different sites and are thus also best represented by PSFM. Likewise, site-specific recombinases also show a varied range of affinities for different target sites.

History and main experimental techniques

The existence of something akin to DNA binding sites was suspected from the experiments on the biology of the bacteriophage lambda  and the regulation of the Escherichia coli lac operon
Lac operon
The lac operon is an operon required for the transport and metabolism of lactose in Escherichia coli and some other enteric bacteria. It consists of three adjacent structural genes, lacZ, lacY and lacA. The lac operon is regulated by several factors including the availability of glucose and of...

. DNA binding sites were finally confirmed in both systems with the advent of DNA sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

 techniques. From then on, DNA binding sites for many transcription factors, restriction enzymes and site-specific recombinases have been discovered using a profusion of experimental methods. Historically, the experimental techniques of choice to discover and analyze DNA binding sites have been the DNAse footprinting assay
DNase footprinting assay
A DNase footprinting assay is a DNA footprinting technique from molecular biology/biochemistry that detects DNA-protein interaction using the fact that a protein bound to DNA will often protect that DNA from enzymatic cleavage. This makes it possible to locate a protein binding site on a particular...

 and the Electrophoretic Mobility Shift Assay
Electrophoretic mobility shift assay
An electrophoretic mobility shift assay or mobility shift electrophoresis, also referred as a gel shift assay, gel mobility shift assay, band shift assay, or gel retardation assay, is a common affinity electrophoresis technique used to study protein–DNA or protein–RNA interactions...

 (EMSA). However, the development of DNA microarrays and fast sequencing techniques has led to new, massively parallel methods for in-vivo identification of binding sites, such as ChIP-chip and ChIP-Seq. To quantify the binding affinity of proteins and other molecules to specific DNA binding sites the biophysical method Microscale Thermophoresis
Microscale Thermophoresis
Microscale Thermophoresis is a technology for the analysis of biomolecules. Microscale Thermophoresis is the directed movement of particles in a microscopic temperature gradient...

 is used.

Databases

Due to the diverse nature of the experimental techniques used in determining binding sites and to the patchy coverage of most organisms and transcription factors, there is no central database (akin to GenBank
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...

 at the National Center for Biotechnology Information
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

) for DNA binding sites. Even though NCBI contemplates DNA binding site annotation in its reference sequences (RefSeq
RefSeq
The Reference Sequence database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. This database is built by National Center for Biotechnology Information , and, unlike GenBank, provides only single record for each natural...

), most submissions omit this information. Moreover, due to the limited success of bioinformatics in producing efficient DNA binding site prediction tools (large false positive rates are often associated with in-silico motif discovery / site search methods), there has been no systematic effort to computationally annotate these features in sequenced genomes.

There are, however, several private and public databases devoted to compilation of experimentally reported, and sometimes computationally predicted, binding sites for different transcription factors in different organisms. Below is a non-exhaustive table of available databases:
Name Organisms Source Access URL
RegTransBase
RegTransBase
RegTransBase is database of regulatory interactions and transcription factor binding sites in prokaryotes...

Prokaryotes Expert/literature curation Public http://regtransbase.lbl.gov/cgi-bin/regtransbase?page=main
RegulonDB
RegulonDB
RegulonDB is a database of the regulatory network of Escherichia coli K-12....

Escherichia coli Expert curation Public http://regulondb.ccg.unam.mx/
PRODORIC Prokaryotes Expert curation Public http://prodoric.tu-bs.de/
TRANSFAC
TRANSFAC
TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites....

Mammals Expert/literature curation Private http://www.biobase-international.com/pages/index.php?id=transfac
TRED Human, Mouse, Rat Computer predictions, manual curation Public http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home
DBSD Drosophila species Literature/Expert curation Public http://rulai.cshl.org/dbsd/index.html

Representation of DNA binding sites

A collection of DNA binding sites, typically referred to as a DNA binding motif, can be represented by a consensus sequence
Consensus sequence
In molecular biology and bioinformatics, consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. A consensus sequence is a way of representing the results of a multiple sequence alignment, where related sequences are...

. This representation has the advantage of being compact, but at the expense of disregarding a substantial amount of information. A more accurate way of representing binding sites is through Position Specific Frequency Matrices (PSFM). These matrices give information on the frequency of each base at each position of the DNA binding motif. PSFM are usually conceived with the implicit assumption of positional independence (different positions at the DNA binding site contribute independently to the site function), although this assumption has been disputed for some DNA binding sites. Frequency information in a PSFM can be formally interpreted under the framework of Information Theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

, leading to its graphical representation as a sequence logo
Sequence logo
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides or amino acids .-Logo creation:...

.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A 1 0 1 5 32 5 35 23 34 14 43 13 34 4 52 3
C 50 1 0 1 5 6 0 4 4 13 3 8 17 51 2 0
G 0 0 54 15 5 5 12 2 7 1 1 3 1 0 1 52
T 5 55 1 35 14 40 9 27 11 28 9 32 4 1 1 1
Sum 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56 56

PSFM for the transcriptional repressor LexA
Lexa
Lexa:* Lexa Pierce* Repressor lexA - Place name :* Lexa, Arkansas, a city in Phillips County, United StatesLexa means aborigional or native according to the dictionary.- Family name :* the House of Lexa , Lexa von Aehrenthal:...

 as derived from 56 LexA-binding sites stored in Prodoric. Relative frequencies are obtained by dividing the counts in each cell by the total count (56)

Computational search and discovery of binding sites

In bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

, one can distinguish between two separate problems regarding DNA binding sites: searching for additional members of a known DNA binding motif (the site search problem) and discovering novel DNA binding motifs in collections of functionally related sequences (the sequence motif
Sequence motif
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...

 discovery problem). Many different methods have been proposed to search for binding sites. Most of them rely on the principles of information theory and have available web servers (Yellaboina)(Munch), while other authors have resorted to machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 methods, such as artificial neural networks. A plethora of algorithms is also available for sequence motif
Sequence motif
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...

 discovery. These methods rely on the hypothesis that a set of sequences share a binding motif for functional reasons. Binding motif discovery methods can be divided roughly into enumerative, deterministic and stochastic. MEME
Multiple EM for Motif Elicitation
Multiple EM for Motif Elicitation or MEME is a tool for discovering motifs in a group of related DNA or protein sequences.A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences...

  and Consensus are classical examples of deterministic optimization, while the Gibbs sampler
Gibbs sampling
In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables...

  is the conventional implementation of a purely stochastic method for DNA binding motif discovery. While enumerative methods often resort to regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

representation of binding sites, PSFM and their formal treatment under Information Theory methods are the representation of choice for both deterministic and stochastic methods. Recent advances in sequencing have led to the introduction of comparative genomics approaches to DNA binding motif discovery, as exemplified by PhyloGibbs.

More complex methods for binding site search and motif discovery rely on the base stacking and other interactions between DNA bases, but due to the small sample sizes typically available for binding sites in DNA and the need for setting parameters, their efficiency is still questioned. An example of such tool is the ULPB

Further reading

  • Erill, I., "A gentle introduction to information content in transcription factor binding sites", Eprint

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK