Cook's distance
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, Cook's distance is a commonly used estimate of the influence of a data point when doing least squares regression analysis
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. In a practical ordinary least squares
Ordinary least squares
In statistics, ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear...

 analysis, Cook's distance can be used in several ways: to indicate data points that are particularly worth checking for validity; to indicate regions of the design space where it would be good to be able obtain more data points.

Definition

Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

s) and/or high leverage
Leverage (statistics)
In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values...

 may distort the outcome and accuracy of a regression. Points with a large Cook's distance are considered to merit closer examination in the analysis.


The following is an algebraically equivalent expression


In the above equations: is the prediction from the full regression model for observation j; is the prediction for observation j from a refitted regression model in which observation i has been omitted; is the i-th diagonal element of the hat matrix
Hat matrix
In statistics, the hat matrix, H, maps the vector of observed values to the vector of fitted values. It describes the influence each observed value has on each fitted value...

 ; is the crude residual (i.e., the difference between the observed value and the value fitted by the proposed model);
MSE is the mean square error of the regression model; is the number of fitted parameters in the model

Detecting highly influential observations using Cook's distance

There are different opinions regarding what cut-off values to use for spotting highly influential points. A simple operational guideline of has been suggested. Others have indicated that , where is the number of observations, might be used.

Interpreting Cook's distance

Specifically can be interpreted as the distance one's estimates move within the confidence ellipsoid that represents a region of plausible values for the parameters. This is shown by an alternative but equivalent representation of Cook's distance in terms of changes to the estimates of the regression parameters between the cases where the particular observation is either included or excluded from the regression analysis.

See also

  • Outlier
    Outlier
    In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

  • Leverage (statistics)
    Leverage (statistics)
    In statistics, leverage is a term used in connection with regression analysis and, in particular, in analyses aimed at identifying those observations that are far away from corresponding average predictor values...

  • Partial leverage
  • DFFITS
  • Studentized residual
    Studentized residual
    In statistics, a studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. Typically the standard deviations of residuals in a sample vary greatly from one data point to another even when the errors all have the same standard...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK