Bayesian tool for methylation analysis
Encyclopedia
Bayesian tool for methylation analysis, also known as BATMAN, is a statistical tool for analyzing methylated DNA immunoprecipitation
Methylated DNA immunoprecipitation
Methylated DNA immunoprecipitation is a large-scale technique that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine . This technique was first described by Weber M...

 (MeDIP) profiles. It can be applied to large datasets generated using either oligonucleotide
Oligonucleotide
An oligonucleotide is a short nucleic acid polymer, typically with fifty or fewer bases. Although they can be formed by bond cleavage of longer segments, they are now more commonly synthesized, in a sequence-specific manner, from individual nucleoside phosphoramidites...

 arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq), providing a quantitative estimation of absolute methylation
Methylation
In the chemical sciences, methylation denotes the addition of a methyl group to a substrate or the substitution of an atom or group by a methyl group. Methylation is a form of alkylation with, to be specific, a methyl group, rather than a larger carbon chain, replacing a hydrogen atom...

 state in a region of interest.

Theory

MeDIP (methylated DNA immunoprecipitation) is an experimental technique used to assess DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 methylation levels by using an antibody
Antibody
An antibody, also known as an immunoglobulin, is a large Y-shaped protein used by the immune system to identify and neutralize foreign objects such as bacteria and viruses. The antibody recognizes a unique part of the foreign target, termed an antigen...

 to isolate methylated DNA sequences. The isolated fragments of DNA are either hybridized to a microarray chip (MeDIP-chip) or sequenced by next-generation sequencing (MeDIP-seq). While this tells you what areas of the genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....

 are methylated, it does not give absolute methylation levels. Imagine two different genomic regions, A and B. Region A has six CpGs (DNA methylation in mammalian somatic cell
Somatic cell
A somatic cell is any biological cell forming the body of an organism; that is, in a multicellular organism, any cell other than a gamete, germ cell, gametocyte or undifferentiated stem cell...

s generally occurs at CpG dinucleotides), three of which are methylated. Region B has three CpGs, all of which are methylated. As the antibody simply recognizes methylated DNA, it will bind both these regions equally and subsequent steps will therefore show equal signals for these two regions. This does not give the full picture of methylation in these two regions (in region A only half the CpGs are methylated, whereas in region B all the CpGs are methylated). Therefore, to get the full picture of methylation for a given region you have to normalize the signal you get from the MeDIP experiment to the number of CpGs in the region, and this is what the Batman algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 does. Analysing the MeDIP signal of the above example would give Batman scores of 0.5 for region A (i.e. the region is 50% methylated) and 1 for region B (i.e. The region is 100% methylated). In this way Batman converts the signals from MeDIP experiments to absolute methylation levels.

Development of Batman

The core principle of the Batman algorithm is to model the effects of varying density of CpG dinucleotides, and the effect this has on MeDIP enrichment of DNA fragments.
The basic assumptions of Batman:
  1. Almost all DNA methylation in mammal
    Mammal
    Mammals are members of a class of air-breathing vertebrate animals characterised by the possession of endothermy, hair, three middle ear bones, and mammary glands functional in mothers with young...

    s happens at CpG dinucleotides.
  2. Most CpG-poor regions are constitutively methylated while most CpG-rich regions (CpG islands) are constitutively unmethylated.
  3. There are no fragment biases in MeDIP experiment (approximate range of DNA fragment sizes is 400–700 bp).
  4. The errors on the microarray
    Microarray
    A microarray is a multiplex lab-on-a-chip. It is a 2D array on a solid substrate that assays large amounts of biological material using high-throughput screening methods.Types of microarrays include:...

     are normally distributed with precision.
  5. Only methylated CpGs contribute to the observed signal.
  6. CpG methylation state is generally highly correlated over hundreds of bases, so CpGs grouped together in 50- or 100-bp windows would have the same methylation state.


Basic parameters in Batman:
  1. Ccp: coupling factor between probe p and CpG dinucleotide c, is defined as the fraction of DNA molecule
    Molecule
    A molecule is an electrically neutral group of at least two atoms held together by covalent chemical bonds. Molecules are distinguished from ions by their electrical charge...

    s hybridizing to probe p that contain the CpG c.
  2. Ctot : total CpG influence parameter , is defined as the sum of coupling factors for any given probe, which provides a measure of local CpG density
  3. mc : the methylation status at position c, which represents the fraction of chromosome
    Chromosome
    A chromosome is an organized structure of DNA and protein found in cells. It is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences. Chromosomes also contain DNA-bound proteins, which serve to package the DNA and control its functions.Chromosomes...

    s in the sample on which it is methylated. mc is considered as a continuous variable since the majority samples used in MeDIP studies contain multiple cell-types.

Based on these assumptions, the signal from the MeDIP channel of the MeDIP-chip or MeDIP-seq experiment depends on the degree of enrichment of DNA fragments overlapping that probe, which in turn depends on the amount of antibody binding, and thus to the number of methylated CpGs on those fragments. In Batman model, the complete dataset from a MeDIP/chip experiment, A, can be represented by a statistical model in the form of the following probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

:


where (x|μσ2) is a Gaussian probability density function
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

. Standard Bayesian
Bayesian
Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes , in particular methods related to statistical inference:...

 techniques can be used to infer f(m|A), that is, the distribution of likely methylation states given one or more sets of MeDIP-chip/MeDIP-seq outputs. To solve this inference problem, Batman uses nested sampling (http://www.inference.phy.cam.ac.uk/bayesys/) to generate 100 independent samples from f(m|A) for each tiled region of the genome, then summarizes the most likely methylation state in 100-bp windows by fitting beta distributions to these samples. The modes of the most likely beta distributions were used as final methylation calls.

Work flow of Batman

Batman prerequisites:
  1. Installation: install Batman(freely available from http://td-blade.gurdon.cam.ac.uk/software/batman/ under the GNU Lesser General Public License
    GNU Lesser General Public License
    The GNU Lesser General Public License or LGPL is a free software license published by the Free Software Foundation . It was designed as a compromise between the strong-copyleft GNU General Public License or GPL and permissive licenses such as the BSD licenses and the MIT License...

    ), Apache ANT
    Apache Ant
    Apache Ant is a software tool for automating software build processes. It is similar to Make but is implemented using the Java language, requires the Java platform, and is best suited to building Java projects....

    , MySQL
    MySQL
    MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

     database server
    Database server
    A database server is a computer program that provides database services to other computer programs or computers, as defined by the client–server model. The term may also refer to a computer dedicated to running such a program...

    , and MySQL database connector.
  2. Prepare dataset: break your dataset into small blocks, namely regions of interest
    Region of interest
    A Region of Interest, often abbreviated ROI, is a selected subset of samples within a dataset identified for a particular purpose.For example:* on a waveform , a time or frequency interval...

     (ROIs), each represented by a small number (typically about 100) probes on a microarray.
  3. Identify the database server: connect to a MySQL database server using both the MySQL administration tool, and many of the Batman programs.
  4. Initialize the Batman database: create a database on your database server.
  5. Register the experiments to be analysed.
  6. Register the array design: The array design (i.e. complete list of probes, with their genomic locations) should be provided as a GFF file.
  7. Load the array data.
  8. Load the genome sequence.


Run Batman:
  1. Calibrate the Batman model: Before any data can be analysed, it is necessary to calibrate each array by estimating how much extra array signal is produced by each methylated CpG. This step can give you a quick idea whether each of your arrays is giving sensible results.
  2. Sample methylation states from the Batman model: You’ll often have multiple arrays from the same experiment, and these should normally be analysed together to improve the confidence of the final calls. Each chromosome can take several days to process; therefore, if possible, run several in parallel.
  3. Summarize methylation states to generate the final calls: The “sample” files generated by Batman contain a large set of plausible methylation states for each region. For most purposes, you’ll actually want a single estimate of the likely methylation state at that position, and perhaps an estimate of how confident you can be that this is actually the correct value.


Visualization of Batman Data:
  1. The output is in GFF format. For each window, a score (range: 0–1) is given which represents a likely fraction of methylation and the interquartile range is given as an estimate of confidence.
  2. Several genome browser
    Genome browser
    A genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, variation,...

    s are available, such as Ensembl
    Ensembl
    Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project...

     genome browser, which uses a colour gradient from 20 (bright yellow) to 80 (dark blue) to show the Batman methylation score for each probe in the ROI.

More details related to Batman procedure can be found in Batman manual freely online from http://td-blade.gurdon.cam.ac.uk/software/batman/batmanual-alpha-0.2.3.pdf

Limitations

It may be useful to take the following points into account when considering using Batman:
  1. Batman is not a piece of software; it is an algorithm performed using the command prompt
    Command Prompt
    Command Prompt is the Microsoft-supplied command-line interpreter on OS/2, Windows CE and on Windows NT-based operating systems...

    . As such it is not especially user-friendly and is quite a computationally technical process.
  2. Because it is non-commercial, there is very little support when using Batman beyond what is in the manual.
  3. It is quite time consuming (it can take several days to analyse one chromosome).
  4. Copy number variation (CNV) has to be accounted for. For example, the score for a region with a CNV value of 1.6 in a cancer
    Cancer
    Cancer , known medically as a malignant neoplasm, is a large group of different diseases, all involving unregulated cell growth. In cancer, cells divide and grow uncontrollably, forming malignant tumors, and invade nearby parts of the body. The cancer may also spread to more distant parts of the...

     (a loss of 0.4 compared to normal) would have to be multiplied by 1.25 (=2/1.6) to compensate for the loss.
  5. One of the basic assumptions of Batman is that all DNA methylation occurs at CpG dinucleotides. While this is generally the case for vertebrate
    Vertebrate
    Vertebrates are animals that are members of the subphylum Vertebrata . Vertebrates are the largest group of chordates, with currently about 58,000 species described. Vertebrates include the jawless fishes, bony fishes, sharks and rays, amphibians, reptiles, mammals, and birds...

     somatic cells, there are situations where there is widespread non-CpG methylation, such as in plant cells and embryonic stem cell
    Embryonic stem cell
    Embryonic stem cells are pluripotent stem cells derived from the inner cell mass of the blastocyst, an early-stage embryo. Human embryos reach the blastocyst stage 4–5 days post fertilization, at which time they consist of 50–150 cells...

    s.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK