Bayes estimator
Encyclopedia
In estimation theory
and decision theory
, a Bayes estimator or a Bayes action is an estimator
or decision rule
that minimizes the posterior
expected value
of a loss function
(i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility
function. An alternative way of formulating an estimator within Bayesian statistics
is Maximum a posteriori estimation.
, such as squared error. The Bayes risk of is defined as , where the expectation
is taken over the probability distribution of : this defines the risk function as a function of . An estimator is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss for each x also minimizes the Bayes risk and therefore is a Bayes estimator.
If the prior is improper then an estimator which minimizes the posterior expected loss for each x is called a generalized Bayes estimator.
where the expectation is taken over the joint distribution of and .
Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,
This is known as the minimum mean square error (MMSE) estimator. The Bayes risk, in this case, is the posterior variance.
is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family
, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.
Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.
Following are some examples of conjugate priors.
Estimation theory
Estimation theory is a branch of statistics and signal processing that deals with estimating the values of parameters based on measured/empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the...
and decision theory
Decision theory
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...
, a Bayes estimator or a Bayes action is an estimator
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result are distinguished....
or decision rule
Decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....
that minimizes the posterior
Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...
expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of a loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...
(i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility
Utility
In economics, utility is a measure of customer satisfaction, referring to the total satisfaction received by a consumer from consuming a good or service....
function. An alternative way of formulating an estimator within Bayesian statistics
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...
is Maximum a posteriori estimation.
Definition
Suppose an unknown parameter θ is known to have a prior distribution . Let be an estimator of θ (based on some measurements x), and let be a loss functionLoss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...
, such as squared error. The Bayes risk of is defined as , where the expectation
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
is taken over the probability distribution of : this defines the risk function as a function of . An estimator is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss for each x also minimizes the Bayes risk and therefore is a Bayes estimator.
If the prior is improper then an estimator which minimizes the posterior expected loss for each x is called a generalized Bayes estimator.
Minimum mean square error estimation
The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined bywhere the expectation is taken over the joint distribution of and .
Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,
This is known as the minimum mean square error (MMSE) estimator. The Bayes risk, in this case, is the posterior variance.
Bayes estimators for conjugate priors
If there is no inherent reason to prefer one prior probability distribution over another, a conjugate priorConjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family
Parametric family
In mathematics and its applications, a parametric family or a parameterized family is a family of objects whose definitions depend on a set of parameters....
, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.
Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.
Following are some examples of conjugate priors.
- If x|θ is normal, x|θ ~ N(θ,σ2), and the prior is normal, θ ~ N(μ,τ2), then the posterior is also normal and the Bayes estimator under MSE is given by
- If x1,...,xn are iid PoissonPoisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...
random variables xi|θ ~ P(θ), and if the prior is Gamma distributed θ ~ G(a,b), then the posterior is also Gamma distributed, and the Bayes estimator under MSE is given by - If x1,...,xn are iid uniformly distributedUniform distribution (continuous)In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by...
xi|θ~U(0,θ), and if the prior is Pareto distributed θ~Pa(θ0,a), then the posterior is also Pareto distributed, and the Bayes estimator under MSE is given by
Alternative risk functions
Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by .- A "linear" loss function, with , which yields the posterior median as the Bayes' estimate:
- Another "linear" loss function, which assigns different "weights" to over or sub estimation. It yields a quantileQuantileQuantiles are points taken at regular intervals from the cumulative distribution function of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets...
from the posterior distribution, and is a generalization of the previous loss function: - The following loss function is trickier: it yields either the posterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter are recommended, in order to use the mode as an approximation ():
- If a Bayes rule is unique then it is admissible. For example, as stated above, under mean squared error (MSE) the Bayes rule is unique and therefore admissible.
- If θ belongs to a discreteDiscreteDiscrete in science is the opposite of continuous: something that is separate; distinct; individual.Discrete may refer to:*Discrete particle or quantum in physics, for example in quantum theory...
set, then all Bayes rules are admissible. - If θ belongs to a continuous (non-discrete set), and if the risk function R(θ,δ) is continuous in θ for every δ, then all Bayes rules are admissible.
- Admissible decision ruleAdmissible decision ruleIn statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....
- Recursive Bayesian estimationRecursive Bayesian estimationRecursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function recursively over time using incoming measurements and a mathematical process model.-In robotics:...
- Empirical Bayes methodEmpirical Bayes methodEmpirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standardBayesian methods, for which the prior distribution is fixed before any data are observed...
- Conjugate priorConjugate priorIn Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
- Generalized expected utilityGeneralized expected utilityThe expected utility model developed by John von Neumann and Oskar Morgenstern dominated decision theory from its formulation in 1944 until the late 1970s, not only as a prescriptive, but also as a descriptive model, despite powerful criticism from Maurice Allais and Daniel Ellsberg who showed...
Other loss functions can be conceived, although the mean squared error
Mean squared error
In statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...
is the most widely used and validated.
Generalized Bayes estimators
The prior distribution has thus far been assumed to be a true probability distribution, in thatHowever, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a non-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function , but this would not be a proper probability distribution since it has infinite mass,
Such measure
Measure (mathematics)
In mathematical analysis, a measure on a set is a systematic way to assign to each suitable subset a number, intuitively interpreted as the size of the subset. In this sense, a measure is a generalization of the concepts of length, area, and volume...
s , which are not probability distributions, are referred to as improper priors.
The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution
This is a definition, and not an application of Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....
, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss
is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator.
Example
A typical example concerns the estimation of a location parameterLocation parameter
In statistics, a location family is a class of probability distributions that is parametrized by a scalar- or vector-valued parameter μ, which determines the "location" or shift of the distribution...
with a loss function of the type . Here is a location parameter, i.e., .
It is common to use the improper prior in this case, especially when no other more subjective information is available. This yields
so the posterior expected loss equals
The generalized Bayes estimator is the value which minimizes this expression for all . This is equivalent to minimizing for all (1)
It can be shown that, in this case, the generalized Bayes estimator has the form , for some constant . To see this, let be the value minimizing (1) when . Then, given a different value , we must minimize (2)
This is identical to (1), except that has been replaced by . Thus, the expression minimizing is given by , so that the optimal estimator has the form
Empirical Bayes estimators
A Bayes estimator derived through the empirical Bayes methodEmpirical Bayes method
Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standardBayesian methods, for which the prior distribution is fixed before any data are observed...
is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.
There are parametric
Parametric statistics
Parametric statistics is a branch of statistics that assumes that the data has come from a type of probability distribution and makes inferences about the parameters of the distribution. Most well-known elementary statistical methods are parametric....
and non-parametric
Non-parametric statistics
In statistics, the term non-parametric statistics has at least two different meanings:The first meaning of non-parametric covers techniques that do not rely on data belonging to any particular distribution. These include, among others:...
approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data.
Example
The following is a simple example of parametric empirical Bayes estimation. Given past observations having conditional distribution , one is interested in estimating based on . Assume that the 's have a common prior which depends on unknown parameters. For example, suppose that is normal with unknown mean and variance We can then use the past observations to determine the mean and variance of in the following way.First, we estimate the mean and variance of the marginal distribution of using the maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
approach:
Next, we use the relation
where and are the moments of the conditional distribution , which are assumed to be known. In particular, suppose that and that ; we then have
Finally, we obtain the estimated moments of the prior,
For example, if , and if we assume a normal prior (which is a conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
in this case), we conclude that , from which the Bayes estimator of based on can be calculated.
Admissibility
Bayes rules having finite Bayes risk are typically admissibleAdmissible decision rule
In statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....
. The following are some specific examples of admissibility theorems.
By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for ; this is known as Stein's phenomenon.
Asymptotic efficiency
Let θ be an unknown random variable, and suppose that are iid samples with density . Let be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of for large n.To this end, it is customary to regard θ as a deterministic parameter whose true value is . Under specific conditions, for large samples (large values of n), the posterior density of θ is approximately normal. In other words, for large n, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is asymptotically unbiased and it converges in distribution to the normal distribution:
where I(θ0) is the fisher information
Fisher information
In mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...
of θ0.
It follows that the Bayes estimator δn under MSE is asymptotically efficient.
Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.
Consider the estimator of θ based on binomial sample x~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
, which in this case is the Beta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is
The MLE in this case is x/n and so we get,
The last equation implies that, for n → ∞, the Bayes estimator (in the described problem) is close to the MLE.
On the other hand, when n is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that a=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as a+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation d which gives the weight of prior information equal to 1/(4d2)-1 bits of new information."
Practical example of misapplication of Bayes estimators
For many years, the Internet Movie DatabaseInternet Movie Database
Internet Movie Database is an online database of information related to movies, television shows, actors, production crew personnel, video games and fictional characters featured in visual entertainment media. It is one of the most popular online entertainment destinations, with over 100 million...
has used a formula for calculating the Top Rated 250 Titles which is claimed to give "a true Bayesian estimate":
where: = weighted rating = average for the movie as a number from 0 to 10 (mean) = (Rating) = number of votes for the movie = (votes) = minimum votes required to be listed in the Top 250 (currently 3000) = the mean vote across the whole report (currently 6.9)
for the Top 250, only votes from regular voters are considered.
Comparing this formula with one in the preceding section, one can see that m must have been related to the relative weight of the prior information in units of the new information given by one vote. Hence C must be the mean vote across the movies with more than 3000 votes, and m should be related to the deviation of votes in this pool.
For example, assume that a new vote brings in about 2 bits of information (one bit for above/below average, and 1 bit for "how far from average is the vote" - so this assumes that votes are close to the average, but not very close). Then having m=3000 corresponds to prior information weighting 6000 bits. To illustrate to which kind of prior distribution such a giant weight might correspond, consider again the distribution of the preceding section (it is related to a very different process of measurement, but the order of magnitude should be close); then 6000 bits correspond to d about 1/155 of the possible range (1 to 10), or 1/17.
Needless to say that such a small deviation is absurd - the minimal possible deviation given integer values and the average of 6.9 is 0.3. For example, if the actual deviation is about 0.7, this corresponds to the prior information weighting close to 40 bits, and m being about 20. Of course, with such a small value for m, the formula above becomes practically indistinguishable from the common-sense formula W=R expected with such a high entrance threshold as having 3000 votes.
As used, all that the formula does is give a major boost to films with significantly more than 3000 votes.