Stein's example
Encyclopedia
Stein's example in decision theory
and estimation theory
, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimator
s more accurate on average (that is, having lower expected mean-squared error) than any method that handles the parameters separately. This is surprising since the parameters and the measurements might be totally unrelated.
An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent; this occurs in channel estimation in telecommunications, for instance (different factors affect overall channel performance). On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse – for example, jointly estimating the speed of light, annual tea consumption in Taiwan, and hog weight in Montana does not improve the estimate of the speed of light, and indeed makes it worse.
The phenomenon is named after its discoverer, Charles Stein.
, Gaussian random variable
s, with mean θ and variance 1, i.e.,
Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate.
Under such conditions, it is most intuitive (and most common) to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as
The quality of such an estimator is measured by its risk function
. A commonly used risk function is the mean squared error
, defined as
Surprisingly, it turns out that the "ordinary" estimator proposed above is suboptimal in terms of mean squared error. In other words, in the setting discussed here, there exist alternative estimators which always achieve lower mean squared error, no matter what the value of is.
For a given θ one could obviously define a perfect "estimator" which is always just θ, but this estimator would be bad for other values of θ. The estimators of Stein's paradox are, for a given θ, better than X for some values of X but necessarily worse for others (except perhaps for one particular θ vector, for which the new estimate is always better than X). It is only on average that they are better.
More accurately, an estimator is said to dominate
another estimator if, for all values of , the risk of is lower than, or equal to, the risk of , and if the inequality is strict
for some . An estimator is said to be admissible
if no other estimator dominates it, otherwise it is inadmissible. Thus, Stein's example can be simply stated as follows: The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.
Many simple, practical estimators achieve better performance than the ordinary estimator. The best-known example is the James–Stein estimator, which works by starting at X and moving towards a particular point (such as the origin) by an amount inversely proportional to the distance of X from that point.
For a sketch of the proof of this result, see Proof of Stein's example
.
, least squares
estimation and optimal equivariant estimation, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.
To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.
At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. This is of course absurd; we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.
Decision theory
Decision theory in economics, psychology, philosophy, mathematics, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision...
and estimation theory
Estimation theory
Estimation theory is a branch of statistics and signal processing that deals with estimating the values of parameters based on measured/empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the...
, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimator
Estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule and its result are distinguished....
s more accurate on average (that is, having lower expected mean-squared error) than any method that handles the parameters separately. This is surprising since the parameters and the measurements might be totally unrelated.
An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent; this occurs in channel estimation in telecommunications, for instance (different factors affect overall channel performance). On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse – for example, jointly estimating the speed of light, annual tea consumption in Taiwan, and hog weight in Montana does not improve the estimate of the speed of light, and indeed makes it worse.
The phenomenon is named after its discoverer, Charles Stein.
Formal statement
The following is perhaps the simplest form of the paradox. Let θ be a vector consisting of n ≥ 3 unknown parameters. To estimate these parameters, a single measurement Xi is performed for each parameter θi, resulting in a vector X of length n. Suppose the measurements are independentStatistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
, Gaussian random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
s, with mean θ and variance 1, i.e.,
Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate.
Under such conditions, it is most intuitive (and most common) to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as
The quality of such an estimator is measured by its risk function
Risk function
In decision theory and estimation theory, the risk function R of a decision rule, δ, is the expected value of a loss function L:...
. A commonly used risk function is the mean squared error
Mean squared error
In statistics, the mean squared error of an estimator is one of many ways to quantify the difference between values implied by a kernel density estimator and the true values of the quantity being estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or...
, defined as
Surprisingly, it turns out that the "ordinary" estimator proposed above is suboptimal in terms of mean squared error. In other words, in the setting discussed here, there exist alternative estimators which always achieve lower mean squared error, no matter what the value of is.
For a given θ one could obviously define a perfect "estimator" which is always just θ, but this estimator would be bad for other values of θ. The estimators of Stein's paradox are, for a given θ, better than X for some values of X but necessarily worse for others (except perhaps for one particular θ vector, for which the new estimate is always better than X). It is only on average that they are better.
More accurately, an estimator is said to dominate
Dominating decision rule
In decision theory, a decision rule is said to dominate another if the performance of the former is sometimes better, and never worse, than that of the latter....
another estimator if, for all values of , the risk of is lower than, or equal to, the risk of , and if the inequality is strict
Strict
In mathematical writing, the adjective strict is used to modify technical terms which have multiple meanings. It indicates that the exclusive meaning of the term is to be understood. The opposite is non-strict. This is often implicit but can be put explicitly for clarity...
for some . An estimator is said to be admissible
Admissible decision rule
In statistical decision theory, an admissible decision rule is a rule for making a decision such that there isn't any other rule that is always "better" than it, in a specific sense defined below....
if no other estimator dominates it, otherwise it is inadmissible. Thus, Stein's example can be simply stated as follows: The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.
Many simple, practical estimators achieve better performance than the ordinary estimator. The best-known example is the James–Stein estimator, which works by starting at X and moving towards a particular point (such as the origin) by an amount inversely proportional to the distance of X from that point.
For a sketch of the proof of this result, see Proof of Stein's example
Proof of Stein's example
Stein's example is an important result in decision theory which can be stated asThe following is an outline of its proof. The reader is referred to the main article for more information.-Sketched proof:...
.
Implications
Stein's example is surprising, since the "ordinary" decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including maximum likelihood estimation, best linear unbiased estimationBlue
Blue is a colour, the perception of which is evoked by light having a spectrum dominated by energy with a wavelength of roughly 440–490 nm. It is considered one of the additive primary colours. On the HSV Colour Wheel, the complement of blue is yellow; that is, a colour corresponding to an equal...
, least squares
Least squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...
estimation and optimal equivariant estimation, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.
To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.
At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. This is of course absurd; we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced total risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.