Expectation-maximization algorithm
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, an expectation–maximization (EM) algorithm is an iterative method
Iterative method
In computational mathematics, an iterative method is a mathematical procedure that generates a sequence of improving approximate solutions for a class of problems. A specific implementation of an iterative method, including the termination criteria, is an algorithm of the iterative method...

 for finding maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

 or maximum a posteriori
Maximum a posteriori
In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...

 (MAP) estimates of parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....

s in statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

s, where the model depends on unobserved latent variable
Latent variable
In statistics, latent variables , are variables that are not directly observed but are rather inferred from other variables that are observed . Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models...

s. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

History

The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster
Arthur P. Dempster
Arthur Pentland Dempster is a Professor Emeritus in the Harvard University Department of Statistics. He was one of four faculty when the department was founded in 1957.He was a Putnam Fellow in 1951. He obtained his Ph.D. from Princeton University in 1956...

, Nan Laird
Nan Laird
Nan M. Laird is a professor in Biostatistics at Harvard School of Public Health. She served as Chair of the Department from 1990 to 1999. She was the Henry Pickering Walcott Professor of Biostatistics from 1991 to 1999. Laird is a Fellow of the American Statistical Association, as well as the...

, and Donald Rubin
Donald Rubin
Donald Bruce Rubin is the John L. Loeb Professor of Statistics at Harvard University. He was hired by Harvard in 1984, and served as chair of the department from 1985-1994....

. They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. In particular, a very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers following his collaboration with Per Martin-Löf
Per Martin-Löf
Per Erik Rutger Martin-Löf is a Swedish logician, philosopher, and mathematical statistician. He is internationally renowned for his work on the foundations of probability, statistics, mathematical logic, and computer science. Since the late 1970s, Martin-Löf's publications have been mainly in...

 and Anders Martin-Löf
Anders Martin-Löf
Anders Martin-Löf is a Swedish physicist and mathematician. He is a professor in insurance mathematics and mathematical statistics since 1987 at the Department of Mathematics of Stockholm University....

.
The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a convergence analysis for a wider class of problems. Regardless of earlier inventions, the innovative Dempster-Laird-Rubin paper in the Journal of the Royal Statistical Society received an enthusiastic discussion at the Royal Statistical Society meeting with Sundberg calling the paper "brilliant". The Dempster-Laird-Rubin paper established the EM method as an important tool of statistical analysis.

The convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was published by C. F. Jeff Wu in 1983. Wu's proof established the EM method's convergence outside of the exponential family
Exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...

, as claimed by Dempster-Laird-Rubin.

Description

Given a statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

 consisting of a set of observed data, a set of unobserved latent data or missing values
Missing values
In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data....

 , and a vector of unknown parameters , along with a likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...

 , the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood
Marginal likelihood
In statistics, a marginal likelihood function, or integrated likelihood, is a likelihood function in which some parameter variables have been marginalised...

 of the observed data


However, this quantity is often intractable.

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
Expectation step (E step): Calculate the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of the log likelihood function, with respect to the conditional distribution of given under the current estimate of the parameters :
Maximization step (M step): Find the parameter that maximizes this quantity:


Note that in typical models to which EM is applied:
  1. The observed data points may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). There may in fact be a vector of observations associated with each data point.
  2. The missing values
    Missing values
    In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data....

     (aka latent variables) are discrete, drawn from a fixed number of values, and there is one latent variable per observed data point.
  3. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and parameters associated with a particular value of a latent variable (i.e. associated with all data points whose corresponding latent variable has a particular value).

However, it is possible to apply EM to other sorts of models.

The motivation is as follows. If we know the value of the parameters , we can usually find the value of the latent variables by maximizing the log-likelihood over all possible values of , either simply by iterating over or through an algorithm such as the Viterbi algorithm
Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models...

 for hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

s. Conversely, if we know the value of the latent variables , we can find an estimate of the parameters fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both and are unknown:
  1. First, initialize the parameters to some random values.
  2. Compute the best value for given these parameter values.
  3. Then, use the just-computed values of to compute a better estimate for the parameters . Parameters associated with a particular value of will use only those data points whose associated latent variable has that value.
  4. Iterate steps 2 and 3 until convergence.

The algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly called hard EM. The k-means algorithm
K-means algorithm
In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean...

 is an example of this class of algorithms.

However, we can do somewhat better by, rather than making a hard choice for given the current parameter values and averaging only over the set of data points associated with a particular value of , instead determining the probability of each possible value of for each data point, and then using the probabilities associated with a particular value of to compute a weighted average over the entire set of data points. The resulting algorithm is commonly called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these weighted averages are called soft counts (as opposed to the hard counts used in a hard-EM-type algorithm such as k-means). The probabilities computed for are posterior probabilities and are what is computed in the E step. The soft counts used to compute new parameter values are what is computed in the M step.

Properties

Speaking of an expectation (E) step is a bit of a misnomer
Misnomer
A misnomer is a term which suggests an interpretation that is known to be untrue. Such incorrect terms sometimes derive their names because of the form, action, or origin of the subject becoming named popularly or widely referenced—long before their true natures were known.- Sources of misnomers...

. What is calculated in the first step are the fixed, data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

Although an EM iteration does increase the observed data (i.e. marginal) likelihood function there is no guarantee that the sequence converges to a maximum likelihood estimator. For multimodal distributions
Bimodal distribution
In statistics, a bimodal distribution is a continuous probability distribution with two different modes. These appear as distinct peaks in the probability density function, as shown in Figure 1....

, this means that an EM algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values. There are a variety of heuristic or metaheuristic
Metaheuristic
In computer science, metaheuristic designates a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Metaheuristics make few or no assumptions about the problem being optimized and can search very large spaces...

 approaches for escaping a local maximum such as random restart  (starting with several different random initial estimates θ(t)), or applying simulated annealing
Simulated annealing
Simulated annealing is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete...

 methods.

EM is particularly useful when the likelihood is an exponential family
Exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...

: the E step becomes the sum of expectations of sufficient statistics, and the M step involves maximizing a linear function. In such a case, it is usually possible to derive closed form
Closed-form expression
In mathematics, an expression is said to be a closed-form expression if it can be expressed analytically in terms of a bounded number of certain "well-known" functions...

 updates for each step, using the Sundberg formula (published by Rolf Sundberg using unpublished results of Per Martin-Löf
Per Martin-Löf
Per Erik Rutger Martin-Löf is a Swedish logician, philosopher, and mathematical statistician. He is internationally renowned for his work on the foundations of probability, statistics, mathematical logic, and computer science. Since the late 1970s, Martin-Löf's publications have been mainly in...

 and Anders Martin-Löf
Anders Martin-Löf
Anders Martin-Löf is a Swedish physicist and mathematician. He is a professor in insurance mathematics and mathematical statistics since 1987 at the Department of Mathematics of Stockholm University....

).

The EM method was modified to compute maximum a posteriori
Maximum a posteriori
In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...

 (MAP) estimates for Bayesian inference
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

 in the original paper by Dempster, Laird, and Rubin.

There are other methods for finding maximum likelihood estimates, such as gradient descent
Gradient descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...

, conjugate gradient or variations of the Gauss–Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

Alternative description

Under some circumstances, it is convenient to view the EM algorithm as two alternating maximization steps. Consider the function:
where q is an arbitrary probability distribution over the unobserved data z, pZ|X(· |x;θ) is the conditional distribution of the unobserved data given the observed data x, H is the entropy and DKL is the Kullback–Leibler divergence
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

.

Then the steps in the EM algorithm may be viewed as:
Expectation step: Choose q to maximize F:
Maximization step: Choose θ to maximize F:

Applications

EM is frequently used for data clustering
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....

 in machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 and computer vision
Computer vision
Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...

. In natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

, two prominent instances of the algorithm are the Baum-Welch algorithm
Baum-Welch algorithm
In electrical engineering, computer science, statistical computing and bioinformatics, the Baum–Welch algorithm is used to find the unknown parameters of a hidden Markov model . It makes use of the forward-backward algorithm and is named for Leonard E. Baum and Lloyd R...

 (also known as forward-backward) and the inside-outside algorithm
Inside-outside algorithm
The inside-outside algorithm is a way of re-estimating production probabilities in a probabilistic context-free grammar. It was introduced by James K. Baker in 1979 as a generalization of the forward-backward algorithm for parameter estimation on hidden Markov models to stochastic context-free...

 for unsupervised induction of probabilistic context-free grammars.

In psychometrics
Psychometrics
Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement...

, EM is almost indispensable for estimating item parameters and latent abilities of item response theory
Item response theory
In psychometrics, item response theory also known as latent trait theory, strong true score theory, or modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based...

 models.

With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio.

The EM algorithm (and its faster variant Ordered subset expectation maximization) is also widely used in medical image
Medical imaging
Medical imaging is the technique and process used to create images of the human body for clinical purposes or medical science...

 reconstruction, especially in positron emission tomography
Positron emission tomography
Positron emission tomography is nuclear medicine imaging technique that produces a three-dimensional image or picture of functional processes in the body. The system detects pairs of gamma rays emitted indirectly by a positron-emitting radionuclide , which is introduced into the body on a...

 and single photon emission computed tomography
Single photon emission computed tomography
Single-photon emission computed tomography is a nuclear medicine tomographic imaging technique using gamma rays. It is very similar to conventional nuclear medicine planar imaging using a gamma camera. However, it is able to provide true 3D information...

. See below for other faster variants of EM.

Variants

A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such as those utilising conjugate gradient and modified Newton–Raphson techniques. Additionally EM can be utilised with constrained estimation techniques.

Expectation conditional maximization (ECM) replaces each M step with a sequence of conditional maximization (CM) steps in which each parameter θi is maximized individually, conditionally on the other parameters remaining fixed.

This idea is further extended in generalized expectation maximization (GEM) algorithm, in which one only seeks an increase in the objective function F for both the E step and M step under the alternative description.

It is also possible to consider the EM algorithm as a subclass of the MM (Majorize/Minimize or Minorize/Maximize, depending on context) algorithm, and therefore use any machinery developed in the more general case.

Relation to variational Bayes methods

EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

 over the latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over θ as well as the latent variables. In fact the Bayesian approach to inference is simply to treat θ as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes
Variational Bayes
Variational Bayesian methods, also called ensemble learning, are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning...

), we may iterate over each latent variable (now including θ) and optimize them one at a time. There are now k steps per iteration, where k is the number of latent variables. For graphical models this is easy to do as each variable's new Q depends only on its Markov blanket
Markov blanket
In machine learning, the Markov blanket for a node A in a Bayesian network is the set of nodes \partial A composed of A's parents, its children, and its children's other parents. In a Markov network, the Markov blanket of a node is its set of neighbouring nodes...

, so local message passing
Message passing
Message passing in computer science is a form of communication used in parallel computing, object-oriented programming, and interprocess communication. In this model, processes or objects can send and receive messages to other processes...

 can be used for efficient inference.

Geometric interpretation

In information geometry
Information geometry
Information geometry is a branch of mathematics that applies the techniques of differential geometry to the field of probability theory. It derives its name from the fact that the Fisher information is used as the Riemannian metric when considering the geometry of probability distribution families...

, the E step and the M step are interpreted as projections under dual affine connection
Affine connection
In the branch of mathematics called differential geometry, an affine connection is a geometrical object on a smooth manifold which connects nearby tangent spaces, and so permits tangent vector fields to be differentiated as if they were functions on the manifold with values in a fixed vector space...

s, called the e-connection and the m-connection; the Kullback–Leibler divergence
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

 can also be understood in these terms.

Example: Gaussian mixture


Let x = (x1,x2,…,xn) be a sample of n independent observations from a mixture
Mixture model
In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs...

 of two multivariate normal distributions of dimension d, and let z=(z1,z2,…,zn) be the latent variables that determine the component from which the observation originates. and
where and

The aim is to estimate the unknown parameters representing the "mixing" value between the Gaussians and the means and covariances of each:
where the likelihood function is:
where is an indicator function and f is the probability density function
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

 of a multivariate normal. This may be rewritten in exponential family
Exponential family
In probability and statistics, an exponential family is an important class of probability distributions sharing a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential...

 form:

E step

Given our current estimate of the parameters θ(t), the conditional distribution of the Zi is determined by Bayes theorem to be the proportional height of the normal density
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

 weighted by τ:.

Thus, the E step results in the function:

M step

The quadratic form of Q(θ|θ(t)) means that determining the maximising values of θ is relatively straightforward. Firstly note that τ, (μ1,Σ1) and (μ2,Σ2) may be all maximised independently of each other since they all appear in separate linear terms.

Firstly, consider τ, which has the constraint τ1 + τ2=1:
This has the same form as the MLE for the binomial distribution, so:

For the next estimates of (μ1,Σ1):
This has the same form as a weighted MLE for a normal distribution, so and
and, by symmetry: and .

External links

  • Various 1D, 2D and 3D demonstrations of EM together with Mixture Modeling are provided as part of the paired SOCR
    SOCR
    The Statistics Online Computational Resource is a suite of online tools and interactive aids for hands-on learning and teaching concepts in statistical analysis and probability theory developed at the University of California, Los Angeles...

    activities and applets. These applets and activities show empirically the properties of the EM algorithm for parameter estimation in diverse settings.
  • Class hierarchy in C++ (GPL) including Gaussian Mixtures
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK