Bayes estimator
Encyclopedia
In estimation theory
Estimation theory
Estimation theory is a branch of statistics and signal processing that deals with estimating the values of parameters based on measured/empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the...

 and decision theory
Decision theory
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...

, a Bayes estimator or a Bayes action is an estimator
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result are distinguished....

 or decision rule
Decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game theory....

 that minimizes the posterior
Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...

 expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of a loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...

 (i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a utility
Utility
In economics, utility is a measure of customer satisfaction, referring to the total satisfaction received by a consumer from consuming a good or service....

 function. An alternative way of formulating an estimator within Bayesian statistics
Bayesian statistics
Bayesian statistics is that subset of the entire field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities...

 is Maximum a posteriori estimation.

Definition

Suppose an unknown parameter θ is known to have a prior distribution . Let be an estimator of θ (based on some measurements x), and let be a loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...

, such as squared error. The Bayes risk of is defined as , where the expectation
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 is taken over the probability distribution of : this defines the risk function as a function of . An estimator is said to be a Bayes estimator if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss for each x also minimizes the Bayes risk and therefore is a Bayes estimator.

If the prior is improper then an estimator which minimizes the posterior expected loss for each x is called a generalized Bayes estimator.

Minimum mean square error estimation

The most common risk function used for Bayesian estimation is the mean square error (MSE), also called squared error risk. The MSE is defined by
where the expectation is taken over the joint distribution of and .

Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the posterior distribution,
This is known as the minimum mean square error (MMSE) estimator. The Bayes risk, in this case, is the posterior variance.

Bayes estimators for conjugate priors

If there is no inherent reason to prefer one prior probability distribution over another, a conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

 is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some parametric family
Parametric family
In mathematics and its applications, a parametric family or a parameterized family is a family of objects whose definitions depend on a set of parameters....

, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

Following are some examples of conjugate priors.
  • If x|θ is normal, x|θ ~ N(θ,σ2), and the prior is normal, θ ~ N(μ,τ2), then the posterior is also normal and the Bayes estimator under MSE is given by
  • If x1,...,xn are iid Poisson
    Poisson distribution
    In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since...

     random variables xi|θ ~ P(θ), and if the prior is Gamma distributed θ ~ G(a,b), then the posterior is also Gamma distributed, and the Bayes estimator under MSE is given by
  • If x1,...,xn are iid uniformly distributed
    Uniform distribution (continuous)
    In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by...

     xi|θ~U(0,θ), and if the prior is Pareto distributed θ~Pa(θ0,a), then the posterior is also Pareto distributed, and the Bayes estimator under MSE is given by

Alternative risk functions

Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by .
  • A "linear" loss function, with , which yields the posterior median as the Bayes' estimate:

  • Another "linear" loss function, which assigns different "weights" to over or sub estimation. It yields a quantile
    Quantile
    Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets...

     from the posterior distribution, and is a generalization of the previous loss function:

    • The following loss function is trickier: it yields either the posterior mode, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter are recommended, in order to use the mode as an approximation ():


    • Other loss functions can be conceived, although the mean squared error
      Mean squared error
      In statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...

       is the most widely used and validated.

      Generalized Bayes estimators

      The prior distribution has thus far been assumed to be a true probability distribution, in that
      However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a non-informative prior, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function , but this would not be a proper probability distribution since it has infinite mass,
      Such measure
      Measure (mathematics)
      In mathematical analysis, a measure on a set is a systematic way to assign to each suitable subset a number, intuitively interpreted as the size of the subset. In this sense, a measure is a generalization of the concepts of length, area, and volume...

      s , which are not probability distributions, are referred to as improper priors.

      The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution
      This is a definition, and not an application of Bayes' theorem
      Bayes' theorem
      In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

      , since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss
      is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator.

      Example

      A typical example concerns the estimation of a location parameter
      Location parameter
      In statistics, a location family is a class of probability distributions that is parametrized by a scalar- or vector-valued parameter μ, which determines the "location" or shift of the distribution...

       with a loss function of the type . Here is a location parameter, i.e., .

      It is common to use the improper prior in this case, especially when no other more subjective information is available. This yields
      so the posterior expected loss equals
      The generalized Bayes estimator is the value which minimizes this expression for all . This is equivalent to minimizing for all         (1)

      It can be shown that, in this case, the generalized Bayes estimator has the form , for some constant . To see this, let be the value minimizing (1) when . Then, given a different value , we must minimize        (2)
      This is identical to (1), except that has been replaced by . Thus, the expression minimizing is given by , so that the optimal estimator has the form

      Empirical Bayes estimators

      A Bayes estimator derived through the empirical Bayes method
      Empirical Bayes method
      Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standardBayesian methods, for which the prior distribution is fixed before any data are observed...

       is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.

      There are parametric
      Parametric statistics
      Parametric statistics is a branch of statistics that assumes that the data has come from a type of probability distribution and makes inferences about the parameters of the distribution. Most well-known elementary statistical methods are parametric....

       and non-parametric
      Non-parametric statistics
      In statistics, the term non-parametric statistics has at least two different meanings:The first meaning of non-parametric covers techniques that do not rely on data belonging to any particular distribution. These include, among others:...

       approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data.

      Example

      The following is a simple example of parametric empirical Bayes estimation. Given past observations having conditional distribution , one is interested in estimating based on . Assume that the 's have a common prior which depends on unknown parameters. For example, suppose that is normal with unknown mean and variance We can then use the past observations to determine the mean and variance of in the following way.

      First, we estimate the mean and variance of the marginal distribution of using the maximum likelihood
      Maximum likelihood
      In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....

       approach:
      Next, we use the relation

      where and are the moments of the conditional distribution , which are assumed to be known. In particular, suppose that and that ; we then have

      Finally, we obtain the estimated moments of the prior,

      For example, if , and if we assume a normal prior (which is a conjugate prior
      Conjugate prior
      In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

       in this case), we conclude that , from which the Bayes estimator of based on can be calculated.

      Admissibility

      Bayes rules having finite Bayes risk are typically admissible
      Admissible decision rule
      In statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....

      . The following are some specific examples of admissibility theorems.
      • If a Bayes rule is unique then it is admissible. For example, as stated above, under mean squared error (MSE) the Bayes rule is unique and therefore admissible.
      • If θ belongs to a discrete
        Discrete
        Discrete in science is the opposite of continuous: something that is separate; distinct; individual.Discrete may refer to:*Discrete particle or quantum in physics, for example in quantum theory...

         set, then all Bayes rules are admissible.
      • If θ belongs to a continuous (non-discrete set), and if the risk function R(θ,δ) is continuous in θ for every δ, then all Bayes rules are admissible.


      By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for ; this is known as Stein's phenomenon.

      Asymptotic efficiency

      Let θ be an unknown random variable, and suppose that are iid samples with density . Let be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of for large n.

      To this end, it is customary to regard θ as a deterministic parameter whose true value is . Under specific conditions, for large samples (large values of n), the posterior density of θ is approximately normal. In other words, for large n, the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is asymptotically unbiased and it converges in distribution to the normal distribution:


      where I0) is the fisher information
      Fisher information
      In mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...

       of θ0.
      It follows that the Bayes estimator δn under MSE is asymptotically efficient.

      Another estimator which is asymptotically normal and efficient is the maximum likelihood estimator (MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.

      Consider the estimator of θ based on binomial sample x~b(θ,n) where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior
      Conjugate prior
      In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

      , which in this case is the Beta distribution B(a,b), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is
      The MLE in this case is x/n and so we get,
      The last equation implies that, for n → ∞, the Bayes estimator (in the described problem) is close to the MLE.

      On the other hand, when n is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that a=b; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as a+b bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(a,b) exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation d which gives the weight of prior information equal to 1/(4d2)-1 bits of new information."

      Practical example of misapplication of Bayes estimators

      For many years, the Internet Movie Database
      Internet Movie Database
      Internet Movie Database is an online database of information related to movies, television shows, actors, production crew personnel, video games and fictional characters featured in visual entertainment media. It is one of the most popular online entertainment destinations, with over 100 million...

       has used a formula for calculating the Top Rated 250 Titles which is claimed to give "a true Bayesian estimate":

      where: = weighted rating = average for the movie as a number from 0 to 10 (mean) = (Rating) = number of votes for the movie = (votes) = minimum votes required to be listed in the Top 250 (currently 3000) = the mean vote across the whole report (currently 6.9)

      for the Top 250, only votes from regular voters are considered.

      Comparing this formula with one in the preceding section, one can see that m must have been related to the relative weight of the prior information in units of the new information given by one vote. Hence C must be the mean vote across the movies with more than 3000 votes, and m should be related to the deviation of votes in this pool.

      For example, assume that a new vote brings in about 2 bits of information (one bit for above/below average, and 1 bit for "how far from average is the vote" - so this assumes that votes are close to the average, but not very close). Then having m=3000 corresponds to prior information weighting 6000 bits. To illustrate to which kind of prior distribution such a giant weight might correspond, consider again the distribution of the preceding section (it is related to a very different process of measurement, but the order of magnitude should be close); then 6000 bits correspond to d about 1/155 of the possible range (1 to 10), or 1/17.

      Needless to say that such a small deviation is absurd - the minimal possible deviation given integer values and the average of 6.9 is 0.3. For example, if the actual deviation is about 0.7, this corresponds to the prior information weighting close to 40 bits, and m being about 20. Of course, with such a small value for m, the formula above becomes practically indistinguishable from the common-sense formula W=R expected with such a high entrance threshold as having 3000 votes.

      As used, all that the formula does is give a major boost to films with significantly more than 3000 votes.

      See also

      • Admissible decision rule
        Admissible decision rule
        In statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....

      • Recursive Bayesian estimation
        Recursive Bayesian estimation
        Recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function recursively over time using incoming measurements and a mathematical process model.-In robotics:...

      • Empirical Bayes method
        Empirical Bayes method
        Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standardBayesian methods, for which the prior distribution is fixed before any data are observed...

      • Conjugate prior
        Conjugate prior
        In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...

      • Generalized expected utility
        Generalized expected utility
        The expected utility model developed by John von Neumann and Oskar Morgenstern dominated decision theory from its formulation in 1944 until the late 1970s, not only as a prescriptive, but also as a descriptive model, despite powerful criticism from Maurice Allais and Daniel Ellsberg who showed...


      External links

      The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK