Explained variation
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, explained variation or explained randomness measures the proportion to which a mathematical model accounts for the variation (= apparent randomness) of a given data set. Often, variation is quantified as variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

; then, the more specific term explained variance can be used.

The complementary part of the total variation/randomness/variance is called unexplained or residual.

The simplified assumption of explained variance as equal to the square of the correlation coefficient
Pearson product-moment correlation coefficient
In statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...

 has been criticized. "for most social scientists, is of doubtful meaning but great rhetorical value".

Definition

Explained variation is a relatively recent concept. The most authoritative source seems to be Kent (1983) who founded his definition on information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

.

Information gain by better modelling

Following Kent (1983), we use the Fraser information (Fraser 1965)
where is the probability density of a random variable , and with () are two families of parametric models. Model family 0 is the simpler one, with a restricted parameter space .

Parameters are determined by maximum likelihood estimation,.

The information gain of model 1 over model 0 is written as
where a factor of 2 is included for convenience. Γ is always nonnegative; it measures the extent to which the best model of family 1 is better than the best model of family 0 in explaining g(r).

Information gain by a conditional model

Assume a two-dimensional random variable where X shall be considered as an explanatory variable, and Y as a dependent variable. Models of family 1 "explain" Y in terms of X,,
whereas in family 0, X and Y are assumed to be independent. We define the randomness of Y by , and the randomness of Y, given X, by . Then,
can be interpreted as proportion of the randomness which is explained by X.

Special cases and generalized usage

For special models, the above definition yields particularly appealing results. Regrettably, these simplified definitions of explained variance are used even in situations where the underlying assumptions do not hold.

Linear regression

The fraction of variance unexplained
Fraction of variance unexplained
In statistics, the fraction of variance unexplained in the context of a regression task is the fraction of variance of the regressand Y which cannot be explained, i.e., which is not correctly predicted, by the explanatory variables X....

 is an established concept in the context of linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

. The usual definition of the coefficient of determination
Coefficient of determination
In statistics, the coefficient of determination R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information. It is the proportion of variability in a data set that is accounted for by the statistical model...

 seems to be compatible with the fundamental definition of explained variance.

Correlation coefficient as measure of explained variance

Let X be a random vector, and Y a random variable that is modeled by a normal distribution with centre . In this case, the above-derived proportion of randomness equals the squared correlation coefficient
Pearson product-moment correlation coefficient
In statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...

 .

Note the strong model assumptions: the centre of the Y distribution must be a linear function of X, and for any given x, the Y distribution must be normal. In other situations, it is generally not justified to interpret as proportion of explained variance.

Explained variance in principal component analysis

"Explained variance" is routinely used in principal component analysis. The relation to the Fraser-Kent information gain remains to be clarified.

Criticism

As "explained variance" essentially equals the correlation coefficient , it shares all the disadvantages of the latter: it reflects not only the quality of the regression, but also the distribution of the independent (conditioning) variables.

In the words of one critic: "Thus gives the 'percentage of variance explained' by the regression, an expression that, for most social scientists, is of doubtful meaning but great rhetorical value. If this number is large, the regression gives a good fit, and there is little point in searching for additional variables. Other regression equations on different data sets are said to be less satisfactory or less powerful if their is lower. Nothing about supports these claims". And, after constructing an example where is enhanced just by jointly considering data from two different populations: "'Explained variance' explains nothing"

Further reading

  • D A S Fraser (1965) "On Information in Statistics", Ann. Math. Statist., 36 (3), 890-896.
  • C H Achen (1982) Interpreting and Using Regression, Beverly Hills: Sage.
  • J T Kent (1983) "Information gain and a general measure of correlation", Biometrika, 70(1), 163-173.
  • C H Achen (1990) '"What Does "Explained Variance" Explain?: Reply", Political Analysis, 2(1),173-184.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK