Ancillary statistic
Encyclopedia
In statistics
, an ancillary statistic is a statistic
whose sampling distribution
does not depend on which of the probability distributions among those being considered is the distribution of the statistical population from which the data were taken. An ancillary statistic is a pivotal quantity
(function of observations whose distribution does not depend on parameters) that is also a statistic (computed in terms of observations, not depending on any unobserved quantities). They can be used to construct prediction interval
s.
This concept was introduced by the statistical geneticist Sir Ronald Fisher
.
μ and known variance
1. Let
be the sample mean
.
The following statistical measures of dispersion of the sample
are all ancillary statistics, because their sampling distributions do not change as μ changes. Computationally, this is because in the formulas, the μ terms cancel – adding a constant number to a distribution (and all samples) changes its sample maximum and minimum by the same amount, so it does not change their difference, and likewise for others: these measures of dispersion do not depend on location.
Conversely, given i.i.d. normal variables with known mean 1 and unknown variance σ2, the sample mean is not an ancillary statistic of the variance, as the sampling distribution of the sample mean is N(μ, σ2/n), which does depend on σ 2 – this measure of location (specifically, its standard error
) depends on dispersion.
, an ancillary complement is a statistic U that is ancillary to T and such that (T, U) is sufficient. Intuitively, an ancillary complement "adds the missing information" (without duplicating any).
The statistic is particularly useful if one takes T to be a maximum likelihood estimator, which in general will not be sufficient; then one can ask for an ancillary complement. In this case, Fisher argues that one must condition on an ancillary complement to determine information content: one should consider the Fisher information
content of T to not be the marginal of T, but the conditional distribution of T, given U: how much information does T add? This is not possible in general, as no ancillary complement need exist, and if one exists, it need not be unique, nor does a maximum ancillary complement exist.
, suppose a scout observes a batter in N at-bats. Suppose (unrealistically) that the number N is chosen by some random process that is independent
of the batter's ability – say a coin is tossed after each at-bat and the result determines whether the scout will stay to watch the batter's next at-bat. The eventual data are the number N of at-bats and the number X of hits: the data (X, N) are a sufficient statistic. The observed batting average
X/N fails to convey all of the information available in the data because it fails to report the number N of at-bats (e.g., a batting average of .400, which is very high, based on only five at-bats does not inspire anywhere near as much confidence in the player's ability than a 0.400 average based on 100 at-bats). The number N of at-bats is an ancillary statistic because
This ancillary statistic is an ancillary complement to the observed batting average X/N, i.e., the batting average X/N is not a sufficient statistic
, in that it conveys less than all of the relevant information in the data, but conjoined with N, it becomes sufficient.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, an ancillary statistic is a statistic
Statistic
A statistic is a single measure of some attribute of a sample . It is calculated by applying a function to the values of the items comprising the sample which are known together as a set of data.More formally, statistical theory defines a statistic as a function of a sample where the function...
whose sampling distribution
Sampling distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification on the route to statistical inference...
does not depend on which of the probability distributions among those being considered is the distribution of the statistical population from which the data were taken. An ancillary statistic is a pivotal quantity
Pivotal quantity
In statistics, a pivotal quantity or pivot is a function of observations and unobservable parameters whose probability distribution does not depend on unknown parameters....
(function of observations whose distribution does not depend on parameters) that is also a statistic (computed in terms of observations, not depending on any unobserved quantities). They can be used to construct prediction interval
Prediction interval
In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which future observations will fall, with a certain probability, given what has already been observed...
s.
This concept was introduced by the statistical geneticist Sir Ronald Fisher
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...
.
Example
Suppose X1, ..., Xn are independent and identically distributed, and are normally distributed with unknown expected valueExpected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
μ and known variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...
1. Let
be the sample mean
Arithmetic mean
In mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average when the context is clear, is a method to derive the central tendency of a sample space...
.
The following statistical measures of dispersion of the sample
- RangeRange (statistics)In the descriptive statistics, the range is the length of the smallest interval which contains all the data. It is calculated by subtracting the smallest observation from the greatest and provides an indication of statistical dispersion.It is measured in the same units as the data...
: max(X1, ..., Xn) − min(X1, ..., Xn) - Interquartile rangeInterquartile rangeIn descriptive statistics, the interquartile range , also called the midspread or middle fifty, is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles...
: Q3 − Q1 - Sample variance:
are all ancillary statistics, because their sampling distributions do not change as μ changes. Computationally, this is because in the formulas, the μ terms cancel – adding a constant number to a distribution (and all samples) changes its sample maximum and minimum by the same amount, so it does not change their difference, and likewise for others: these measures of dispersion do not depend on location.
Conversely, given i.i.d. normal variables with known mean 1 and unknown variance σ2, the sample mean is not an ancillary statistic of the variance, as the sampling distribution of the sample mean is N(μ, σ2/n), which does depend on σ 2 – this measure of location (specifically, its standard error
Standard error
Standard error can refer to:* Standard error , the estimated standard deviation or error of a series of measurements* Standard error stream, one of the standard streams in Unix-like operating systems...
) depends on dispersion.
Ancillary complement
Given a statistic T that is not sufficientSufficiency (statistics)
In statistics, a sufficient statistic is a statistic which has the property of sufficiency with respect to a statistical model and its associated unknown parameter, meaning that "no other statistic which can be calculated from the same sample provides any additional information as to the value of...
, an ancillary complement is a statistic U that is ancillary to T and such that (T, U) is sufficient. Intuitively, an ancillary complement "adds the missing information" (without duplicating any).
The statistic is particularly useful if one takes T to be a maximum likelihood estimator, which in general will not be sufficient; then one can ask for an ancillary complement. In this case, Fisher argues that one must condition on an ancillary complement to determine information content: one should consider the Fisher information
Fisher information
In mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...
content of T to not be the marginal of T, but the conditional distribution of T, given U: how much information does T add? This is not possible in general, as no ancillary complement need exist, and if one exists, it need not be unique, nor does a maximum ancillary complement exist.
Example
In baseballBaseball
Baseball is a bat-and-ball sport played between two teams of nine players each. The aim is to score runs by hitting a thrown ball with a bat and touching a series of four bases arranged at the corners of a ninety-foot diamond...
, suppose a scout observes a batter in N at-bats. Suppose (unrealistically) that the number N is chosen by some random process that is independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
of the batter's ability – say a coin is tossed after each at-bat and the result determines whether the scout will stay to watch the batter's next at-bat. The eventual data are the number N of at-bats and the number X of hits: the data (X, N) are a sufficient statistic. The observed batting average
Batting average
Batting average is a statistic in both cricket and baseball that measures the performance of cricket batsmen and baseball hitters. The two statistics are related in that baseball averages are directly descended from the concept of cricket averages.- Cricket :...
X/N fails to convey all of the information available in the data because it fails to report the number N of at-bats (e.g., a batting average of .400, which is very high, based on only five at-bats does not inspire anywhere near as much confidence in the player's ability than a 0.400 average based on 100 at-bats). The number N of at-bats is an ancillary statistic because
- It is a part of the observable data (it is a statistic), and
- Its probability distribution does not depend on the batter's ability, since it was chosen by a random process independent of the batter's ability.
This ancillary statistic is an ancillary complement to the observed batting average X/N, i.e., the batting average X/N is not a sufficient statistic
Sufficiency (statistics)
In statistics, a sufficient statistic is a statistic which has the property of sufficiency with respect to a statistical model and its associated unknown parameter, meaning that "no other statistic which can be calculated from the same sample provides any additional information as to the value of...
, in that it conveys less than all of the relevant information in the data, but conjoined with N, it becomes sufficient.
See also
- Basu's theoremBasu's theoremIn statistics, Basu's theorem states that any boundedly complete sufficient statistic is independent of any ancillary statistic. This is a 1955 result of Debabrata Basu....
- Prediction intervalPrediction intervalIn statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval in which future observations will fall, with a certain probability, given what has already been observed...
- Group familyGroup familyIn probability theory, especially as that field is used in statistics, a group family of probability distributions is a family obtained by subjecting a random variable with a fixed distribution to a suitable family of transformations such as a location-scale family, or otherwise a family of...
- Conditionality principleConditionality principleThe conditionality principle is a Fisherian principle of statistical inference that Allan Birnbaum formally defined and studied in his 1962 JASA article. Together with the sufficiency principle, Birnbaum's version of the principle implies the famous likelihood principle...