Mallows' Cp
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, Mallows' Cp, named for Colin L. Mallows, is used to assess the fit
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g...

 of a regression model
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

 that has been estimated using ordinary least squares
Ordinary least squares
In statistics, ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear...

. It is applied in the context of model selection
Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered...

, where a number of predictor variables
Dependent and independent variables
The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects...

 are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. For example, one may be interested in predicting by how much a particular cholesterol-lowering drug will lower a particular person's cholesterol level, based on the person's age, gender, weight, and various dietary and lifestyle factors.

Definition and properties

Mallows' Cp addresses the issue of overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

, in which model selection statistics such as the residual sum of squares always get smaller as more variables are added to a model. Thus, if we aim to select the model giving the smallest residual sum of squares, the model including all variables would always be selected. The
Cp statistic calculated on a sample
Sample (statistics)
In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size...

 of data estimates the mean squared prediction error (MSPE) as its population
Statistical population
A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

 target


where is the fitted value from the regression model for the
jth case, E(Yj | Xj) is the expected value for the jth case, and σ2 is the error variance (assumed constant across the cases). The MSPE will not automatically get smaller as more variables are added. The optimum model under this criterion is a compromise influenced by the sample size, the effect size
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...

s of the different predictors, and the degree of collinearity
Collinearity
A set of points is collinear if they lie on a single line. Related concepts include:In mathematics:...

 between them.

If
P regressors are selected from a set of K > P, the Cp statistic for that particular set of regressors is defined as :


where
  • is the error sum of squares
    Sum of squares
    The partition of sums of squares is a concept that permeates much of inferential statistics and descriptive statistics. More properly, it is the partitioning of sums of squared deviations or errors. Mathematically, the sum of squared deviations is an unscaled, or unadjusted measure of dispersion...

     for the model with
    P regressors,
  • Ypi is the predicted value of the ith observation of Y from the P regressors,
  • S2 is the residual mean square after regression
    Regression analysis
    In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

     on the complete set of K regressors and can be estimated by mean square error MSE,
  • and N is the sample size
    Sample size
    Sample size determination is the act of choosing the number of observations to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample...

    .

Practical use

The Cp statistic is often used as a stopping rule for various forms of stepwise regression
Stepwise regression
In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure...

. Mallows proposed the statistic as a criterion for selecting among many alternative subset regressions. Under a model not suffering from appreciable lack of fit (bias), Cp has expectation nearly equal to P; otherwise the expectation is roughly P plus a positive bias term. Nevertheless, even though it has expectation greater than or equal to P, there is nothing to prevent Cp < P or even Cp < 0 in extreme cases. It is suggested that one should choose a subset that has Cp approaching P, from above, for a list of subsets ordered by increasing P. In practice, the positive bias can be adjusted for by selecting a model from the ordered list of subsets, such that Cp < 2P.

Since the sample-based Cp statistic is an estimate of the MSPE, using Cp for model selection does not completely guard against overfitting. For instance, it is possible that the selected model will be one in which the sample Cp was a particularly severe underestimate of the MSPE.

Model selection statistics such as Cp are generally not used blindly, but rather information about the field of application, the intended use of the model, and any known biases in the data are taken into account in the process of model selection.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK