Bayesian inference in phylogeny - AbsoluteAstronomy.com

Bayesian inference in phylogeny
Phylogenetics
In biology, phylogenetics is the study of evolutionary relatedness among groups of organisms , which is discovered through molecular sequencing data and morphological data matrices...

generates a posterior distribution for a parameter, composed of a phylogenetic tree

Phylogenetic tree

A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical and/or genetic characteristics...

and a model of evolution, based on the prior for that parameter and the likelihood of the data, generated by a multiple alignment. The Bayesian approach has become more popular due to advances in computational machinery, especially, Markov chain Monte Carlo

Markov chain Monte Carlo

Markov chain Monte Carlo methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the...

algorithms. Bayesian inference

Bayesian inference

In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

has a number of applications in molecular phylogenetics, for example, estimation of species

Species

In biology, a species is one of the basic units of biological classification and a taxonomic rank. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring. While in many cases this definition is adequate, more precise or differing measures are...

phylogeny and species divergence times.

Basic Bayesian theory

Recall that for Bayesian inference:

The denominator

is the marginal probability of the data, averaged over all possible parameter values weighted by their prior distribution. Formally,

where

is the parameter space for

.

In the original Metropolis algorithm, given a current

-value

, and a new

-value

, the new value is accepted with probability:

The LOCAL algorithm of Larget and Simon

The LOCAL algorithm begins by selecting an internal branch of the tree at random. The nodes at the ends of this branch are each connected to two other branches. One of each pair is chosen at random. Imagine taking these three selected edges and stringing them like a clothesline from left to right, where the direction (left/right) is also selected at random. The two endpoints of the first branch selected will have a sub-tree hanging like a piece of clothing strung to the line. The algorithm proceeds by multiplying the three selected branches by a common random amount, akin to stretching or shrinking the clothesline. Finally the leftmost of the two hanging sub-trees is disconnected and reattached to the clothesline at a location selected uniformly at random. This is the candidate tree.

Suppose we began by selecting the internal branch with length

(in Figure (a) (to be added)) that separates taxa

and

from the rest. Suppose also that we have (randomly) selected branches with lengths

and

from each side, and that we oriented these branches as shown in Figure(b). Let

, be the current length of the clothesline. We select the new length to be

, where

is a uniform random variable on

. Then for the LOCAL algorithm, the acceptance probability can be computed to be:

Assessing convergence

Suppose we want to estimate a branch length of a 2-taxon tree under JC, in which

sites are unvaried and

are variable. Assume exponential prior distribution with rate

. The density is

. The probabilities of the possible site patterns are:

for unvaried sites, and

Thus the unnormalized posterior distribution is:

or, alternately,

Update branch length by choosing new value uniformly at random from a window of half-width

centered at the current value:

where

is uniformly distributed between

and

. The acceptance
probability is:

Example:

. We will compare results for two values of

and

. In each case, we will begin with an initial length of

and update the length

times. (See Figure 3.2 (to be added) for results.)

Metropolis-coupled MCMC (Geyer)

If the target distribution has multiple peaks, separated by low valleys, the Markov chain may have difficulty in moving from one peak to another. As a result, the chain may get stuck on one peak and the resulting samples will not approximate the posterior density correctly. This is a serious practical concern for phylogeny reconstruction, as multiple local peaks are known to exist in the tree space during heuristic tree search under maximum parsimony (MP), maximum likelihood (ML), and minimum evolution (ME) criteria, and the same can be expected for stochastic tree search using MCMC. Many strategies have been proposed to improve mixing of Markov chains in presence of multiple local peaks in the posterior density. One of the most successful algorithms is the Metropolis-coupled MCMC (or

).

In this algorithm,

chains are run in parallel, with different stationary distributions

, where the first one,

is the target density, while

are chosen to improve mixing. For example, one can choose incremental heating of the form:

so that the first chain is the cold chain with the correct target density, while chains

are heated chains. Note that raising the density

to the power

with

has the effect of flattening out the distribution, similar to heating a metal. In such a distribution, it is easier to traverse between peaks (separated by valleys) than in the original distribution. After each iteration, a swap of states between two randomly chosen chains is proposed through a Metropolis-type step. Let

be the current state in chain

. A swap between the states of chains

and

is accepted with probability:

At the end of the run, output from only the cold chain is used, while those from the hot chains are discarded. Heuristically, the hot chains will visit the local peaks rather easily, and swapping states between chains will let the cold chain occasionally jump valleys, leading to better mixing. However, if

is unstable, proposed swaps will seldom be accepted. This is the reason for using several chains which differ only incrementally. (See Figure3.3 (to be added)).

An obvious disadvantage of the algorithm is that

chains are run and only one chain is used for inference. For this reason,

is ideally suited for implementation on parallel machines, since each chain will in general require the same amount of computation per iteration.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.