Regression toward the mean
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, regression toward the mean (also known as regression to the mean) is the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on a second measurement, and—a fact that may superficially seem paradoxical—if it is extreme on a second measurement, will tend to have been closer to the average on the first measurement. To avoid making wrong inferences, the possibility of regression toward the mean must be considered when designing experiments and interpreting experimental, survey, and other empirical data in the physical, life, behavioral and social sciences.

The conditions under which regression toward the mean occurs depend on the way the term is mathematically defined. Sir Francis Galton first observed the phenomenon in the context of simple linear regression
Simple linear regression
In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model as...

 of data points. However, a less restrictive approach is possible. Regression towards the mean can be defined for any bivariate distribution with identical marginal distribution
Marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. The term marginal variable is used to refer to those variables in the subset of variables being retained...

s. Two such definitions exist. One definition accords closely with the common usage of the term “regression towards the mean”. Not all such bivariate distributions show regression towards the mean under this definition. However, all such bivariate distributions show regression towards the mean under the other definition.

Historically, what is now called regression toward the mean has also been called reversion to the mean and reversion to mediocrity.

Conceptual background

Consider a simple example: a class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student’s score would be a realization of one of a set of independent and identically distributed
Independent and identically distributed random variables
In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent....

 random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s, with a mean
Mean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

 of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one takes only the top scoring 10% of the students and gives them a second test on which they again choose randomly on all items, the mean score would again be expected to be close to 50. Thus the mean of these students would “regress” all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of his score on the second test is 50.

If there were no luck or random guessing involved in the answers supplied by students to the test questions, then all students would score the same on the second test as they scored on the original test, and there would be no regression toward the mean.

Most realistic situations fall between these two extremes: for example, one might consider exam scores as a combination of skill
Skill
A skill is the learned capacity to carry out pre-determined results often with the minimum outlay of time, energy, or both. Skills can often be divided into domain-general and domain-specific skills...

 and luck
Luck
Luck or fortuity is good fortune which occurs beyond one's control, without regard to one's will, intention, or desired result. There are at least two senses people usually mean when they use the term, the prescriptive sense and the descriptive sense...

. In this case, the subset of students scoring above average would be composed of those who were skilled and had not especially bad luck, together with those who were unskilled, but were extremely lucky. On a retest of this subset, the unskilled will be unlikely to repeat their lucky break, while the skilled will have a second chance to have bad luck. Hence, those who did well previously are unlikely to do quite as well in the second test.

The following is a second example of regression toward the mean. A class of students takes two editions of the same test on two successive days. It has frequently been observed that the worst performers on the first day will tend to improve their scores on the second day, and the best performers on the first day will tend to do worse on the second day. The phenomenon occurs because student scores are determined in part by underlying ability and in part by chance. For the first test, some will be lucky, and score more than their ability, and some will be unlucky and score less than their ability. Some of the lucky students on the first test will be lucky again on the second test, but more of them will have (for them) average or below average scores. Therefore a student who was lucky on the first test is more likely to have a worse score on the second test than a better score. Similarly, students who score less than the mean on the first test will tend to see their scores increase on the second test.

History

The concept of regression comes from genetics and was popularized by Sir Francis Galton during the late 19th century with the publication of Regression towards mediocrity in hereditary stature. Galton observed that extreme characteristics (e.g., height) in parents are not passed on completely to their offspring. Rather, the characteristics in the offspring regress towards a mediocre point (a point which has since been identified as the mean). By measuring the heights of hundreds of people, he was able to quantify regression to the mean, and estimate the size of the effect. Galton wrote that, “the average regression of the offspring is a constant fraction of their respective mid-parental deviations”. This means that the difference between a child and its parents for some characteristic is proportional to its parents' deviation from typical people in the population. So if its parents are each two inches taller than the averages for men and women, on average it will be shorter than its parents by some factor (which, today, we would call one minus the regression coefficient
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

) times two inches. For height, Galton estimated this coefficient to be about 2/3: the height of an individual will measure around a mid-point that is two thirds of the parents’ deviation from the population average.

Galton coined the term regression to describe an observable fact in the inheritance of multi-factorial quantitative genetic traits: namely that the offspring of parents who lie at the tails of the distribution will tend to lie closer to the centre, the mean, of the distribution. He quantified this trend, and in doing so invented Linear Regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

 analysis, thus laying the groundwork for much of modern statistical modelling. Since then, the term "regression" has taken on a variety of meanings, and it may be used by modern statisticians to describe phenomena of sampling bias which have little to do with Galton's original observations in the field of Genetics.

Also, Galton's explanation for the regression phenomenon he observed is now known to be incorrect. He stated: “A child inherits partly from his parents, partly from his ancestors. Speaking generally, the further his genealogy goes back, the more numerous and varied will his ancestry become, until they cease to differ from any equally numerous sample taken at haphazard from the race at large.” This is incorrect, since a child receives its genetic makeup entirely and exclusively from its parents – there is no generation skipping in genetic material: any genetic material from ancestors other than (before) the parents must have passed through the parents. Instead, the phenomenon is easy to understand if we assume that the inherited trait (e.g. height) is controlled by a large number of recessive
Recessive
In genetics, the term "recessive gene" refers to an allele that causes a phenotype that is only seen in a homozygous genotype and never in a heterozygous genotype. Every person has two copies of every gene on autosomal chromosomes, one from mother and one from father...

 genes. Exceptionally tall individuals must be homozyguous for increased height mutations on a large proportion of these loci. But the loci which carry these mutations are not necessarily shared between two tall individuals, and if these individuals mate, their offspring will be on average homozyguous for "tall" mutations on fewer loci than either of their offspring. In addition, height is not entirely genetically determined, but also subject to numerous random environmental influences during development, which are also likely to make offspring of exceptional parents on average more average than their parents.

In sharp contrast to this population genetic phenomenon of regression to the mean, which is best thought of as a combination of a binomially distributed process of inheritance (plus normally distributed environmental influences) the term "regression to the mean" is now often used to describe completely different phenomena in which an initial sampling bias may disappear as new, repeated or larger samples display sample means which are closer to the true underlying population mean.

Importance

Regression toward the mean is a significant consideration in the design of experiments
Design of experiments
In general usage, design of experiments or experimental design is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments...

.

Take a hypothetical example of 1,000 individuals of a similar age who were examined and scored on the risk of experiencing a heart attack. Statistics could be used to measure the success of an intervention on the 50 who were rated at the greatest risk. The intervention could be a change in diet, exercise, or a drug treatment. Even if the interventions are worthless, the test group would be expected to show an improvement on their next physical exam, because of regression toward the mean. The best way to combat this effect is to divide the group randomly into a treatment group that receives the treatment, and a control
Scientific control
Scientific control allows for comparisons of concepts. It is a part of the scientific method. Scientific control is often used in discussion of natural experiments. For instance, during drug testing, scientists will try to control two groups to keep them as identical and normal as possible, then...

 group that does not.
The treatment would then be judged effective only if the treatment group improves more than the control group.

Alternatively, a group of disadvantaged
Disadvantaged
The "disadvantaged" is a generic term for individuals or groups of people who:* Face special problems such as physical or mental disability * Lack money or economic support....

 children could be tested to identify the ones with most college potential.
The top 1% could be identified and supplied with special enrichment courses, tutoring, counseling and computers. Even if the program is effective, their average scores may well be less when the test is repeated a year later. However, in these circumstances it may be considered unfair to have a control group of disadvantaged children whose special needs are ignored. A mathematical calculation for shrinkage
Shrinkage (statistics)
In statistics, shrinkage has two meanings:*In relation to the general observation that, in regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coefficient of determination 'shrinks'...

 can adjust for this effect, although it will not be as reliable as the control group method (see also Stein's example
Stein's example
Stein's example , in decision theory and estimation theory, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average than any method that handles the parameters separately...

.)

The effect can also be exploited for general inference and estimation. The hottest place in the country today is more likely to be cooler tomorrow than hotter. The best performing mutual fund over the last three years is more likely to see relative performance decline than improve over the next three years. The most successful Hollywood actor of this year is likely to have less gross than more gross for his or her next movie. The baseball player with the greatest batting average by the All-Star break is more likely to have a lower average than a higher average over the second half of the season.

Misunderstandings

The concept of regression toward the mean can be misused very easily.

In the student test example above, it was assumed implicitly that what was being measured did not change between the two measurements.
But suppose it was a pass/fail course and you had to score above 70 on both tests to pass. Then the students who scored under 70 the first time would have no incentive to do well, and might score worse on average the second time. The students just over 70, on the other hand, would have a strong incentive to study overnight and concentrate while taking the test. In that case you might see movement away from 70, scores below it getting lower and scores above it getting higher. It is possible for changes between the measurement times to augment, offset or reverse the statistical tendency to regress toward the mean.

Statistical regression toward the mean is not a causal
Causality
Causality is the relationship between an event and a second event , where the second event is understood as a consequence of the first....

 phenomenon. A student with the worst score on the test on the first day will not necessarily increase her score substantially on the second day due to the effect. On average the worst scorers improve, but that's only true because the worst scorers are more likely to have been unlucky than lucky. To the extent that a score is determined randomly, or that a score has random variation or error, as opposed to being determined by the student's academic ability or being a "true value", the phenomenon will have an effect. A classic mistake in this regard was in education. The students that received praise for good work were noticed to do more poorly on the next measure, and the students who were punished for poor work were noticed to do better on the next measure. The educators decided to stop praising and keep punishing on this basis. Such a decision was a mistake, because regression toward the mean is not based on cause and effect, but rather on random error in a natural distribution around a mean.

Although individual measurements regress toward the mean, the second sample
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

 of measurements will be no closer to the mean than the first. Consider the students again. Suppose their tendency is to regress 10% of the way toward the mean
Mean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

 of 80, so a student who scored 100 the first day is expected
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 to score 98 the second day, and a student who scored 70 the first day is expected to score 71 the second day. Those expectations are closer to the mean, on average, than the first day scores. But the second day scores will vary around their expectations, some will be higher and some will be lower. This will make the second set of measurements farther from the mean, on average, than their expectations. The effect is the exact reverse of regression toward the mean, and exactly offsets it. So for every individual, we expect the second score to be closer to the mean than the first score, but for all individuals, we expect the average distance from the mean to be the same on both sets of measurements.

Related to the point above, regression toward the mean works equally well in both directions. We expect the student with the highest test score on the second day to have done worse on the first day. And if we compare the best student on the first day to the best student on the second day, regardless of whether it is the same individual or not, there is a tendency to regress toward the mean going in either direction. We expect the best scores on both days to be equally far from the mean.

Regression fallacies

Many phenomena tend to be attributed to the wrong causes when regression to the mean is not taken into account.

An extreme example is Horace Secrist’s 1933 book The Triumph of Mediocrity in Business, in which the statistics professor collected mountains of data to prove that the profit rates of competitive businesses tend toward the average over time. In fact, there is no such effect; the variability of profit rates is almost constant over time. Secrist had only described the common regression toward the mean. One exasperated reviewer, Harold Hotelling
Harold Hotelling
Harold Hotelling was a mathematical statistician and an influential economic theorist.He was Associate Professor of Mathematics at Stanford University from 1927 until 1931, a member of the faculty of Columbia University from 1931 until 1946, and a Professor of Mathematical Statistics at the...

, likened the book to “proving the multiplication table by arranging elephants in rows and columns, and then doing the same for numerous other kinds of animals”.

The calculation and interpretation of “improvement scores” on standardized educational tests in Massachusetts probably provides another example of the regression fallacy. In 1999, schools were given improvement goals. For each school, the Department of Education tabulated the difference in the average score achieved by students in 1999 and in 2000. It was quickly noted that most of the worst-performing schools had met their goals, which the Department of Education took as confirmation of the soundness of their policies. However, it was also noted that many of the supposedly best schools in the Commonwealth, such as Brookline High School (with 18 National Merit Scholarship finalists) were declared to have failed. As in many cases involving statistics and public policy, the issue is debated, but “improvement scores” were not announced in subsequent years and the findings appear to be a case of regression to the mean.

The psychologist Daniel Kahneman
Daniel Kahneman
Daniel Kahneman is an Israeli-American psychologist and Nobel laureate. He is notable for his work on the psychology of judgment and decision-making, behavioral economics and hedonic psychology....

, winner of the 2002 Nobel prize in economics, pointed out that regression to the mean might explain why rebukes can seem to improve performance, while praise seems to backfire.
UK law enforcement policies have encouraged the visible sitting of static or mobile speed cameras at accident blackspot
Accident blackspot
An accident blackspot is a term used in road safety management to denote a place where road traffic accidents have historically been concentrated...

s. This policy was justified by a perception that there is a corresponding reduction in serious road traffic accidents after a camera is set up. However, statisticians have pointed out that, although there is a net benefit in lives saved, failure to take into account the effects of regression to the mean results in the beneficial effects being overstated. It is thus claimed that some of the money currently spent on traffic cameras could be more productively directed elsewhere.

Statistical analysts have long recognized the effect of regression to the mean in sports; they even have a special name for it: the “Sophomore Slump
Sophomore slump
A sophomore slump or sophomore jinx refers to an instance in which a second, or sophomore, effort fails to live up to the standards of the first effort...

”. For example, Carmelo Anthony
Carmelo Anthony
Carmelo Kiyan Anthony , nicknamed "Melo", is an American professional basketball player who currently plays for the New York Knicks in the National Basketball Association...

 of the NBA
National Basketball Association
The National Basketball Association is the pre-eminent men's professional basketball league in North America. It consists of thirty franchised member clubs, of which twenty-nine are located in the United States and one in Canada...

’s Denver Nuggets
Denver Nuggets
The Denver Nuggets are a professional basketball team based in Denver, Colorado. They play in the National Basketball Association . They were founded as the Denver Rockets in 1967 as a charter franchise of the American Basketball Association, and became one of that league's more successful teams...

 had an outstanding rookie season in 2004. It was so outstanding, in fact, that he couldn’t possibly be expected to repeat it: in 2005, Anthony’s numbers had dropped from his rookie season. The reasons for the “sophomore slump” abound, as sports are all about adjustment and counter-adjustment, but luck-based excellence as a rookie is as good a reason as any.

Regression to the mean in sports performance may be the reason for the “Sports Illustrated Cover Jinx
Sports Illustrated Cover Jinx
The Sports Illustrated Cover Jinx is an urban legend that states that individuals or teams who appear on the cover of the Sports Illustrated magazine will subsequently be jinxed...

” and the “Madden Curse”. John Hollinger
John Hollinger
John Hollinger is an analyst and writer for ESPN. He primarily covers the NBA. Hollinger grew up in Mahwah, New Jersey and is a 1993 graduate of the University of Virginia....

 has an alternate name for the phenomenon of regression to the mean: the “fluke rule”, while Bill James
Bill James
George William “Bill” James is a baseball writer, historian, and statistician whose work has been widely influential. Since 1977, James has written more than two dozen books devoted to baseball history and statistics...

 calls it the “Plexiglas Principle”.

Because popular lore has focused on “regression toward the mean” as an account of declining performance of athletes from one season to the next, it has usually overlooked the fact that such regression can also account for improved performance. For example, if one looks at the batting average
Batting average
Batting average is a statistic in both cricket and baseball that measures the performance of cricket batsmen and baseball hitters. The two statistics are related in that baseball averages are directly descended from the concept of cricket averages.- Cricket :...

 of Major League Baseball
Major League Baseball
Major League Baseball is the highest level of professional baseball in the United States and Canada, consisting of teams that play in the National League and the American League...

 players in one season, those whose batting average was above the league mean tend to regress downward toward the mean the following year, while those whose batting average was below the mean tend to progress upward toward the mean the following year.

Other statistical phenomena

Regression toward the mean simply says that, following an extreme random event, the next random event is likely to be less extreme. In no sense does the future event "compensate for" or "even out" the previous event, though this is assumed in the gambler's fallacy
Gambler's fallacy
The Gambler's fallacy, also known as the Monte Carlo fallacy , and also referred to as the fallacy of the maturity of chances, is the belief that if deviations from expected behaviour are observed in repeated independent trials of some random process, future deviations in the opposite direction are...

 (and variant law of averages
Law of averages
The law of averages is a lay term used to express a belief that outcomes of a random event will "even out" within a small sample.As invoked in everyday life, the "law" usually reflects bad statistics or wishful thinking rather than any mathematical principle...

). Similarly, the law of large numbers
Law of large numbers
In probability theory, the law of large numbers is a theorem that describes the result of performing the same experiment a large number of times...

 states that in the long term, the average will tend towards the expected value, but makes no statement about individual trials. For example, following a run of 10 heads on a flip of a fair coin (a rare, extreme event), regression to the mean states that the next run of heads will likely be less than 10, while the law of large numbers states that in the long term, this event will likely average out, and the average fraction of heads will tend to 1/2. By contrast, the gambler's fallacy incorrectly assumes that the coin is now "due" for a run of tails, to balance out.

Definition for simple linear regression of data points

This is the definition of regression toward the mean that closely follows Sir Francis Galton's original usage.

Suppose there are n data points {yi, xi}, where i = 1, 2, …, n. We want to find the equation of the regression line, i.e. the straight line

which would provide a “best” fit for the data points. (Note that a straight line may not be the appropriate regression curve for the given data points.) Here the “best” will be understood as in the least-squares
Ordinary least squares
In statistics, ordinary least squares or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear...

 approach: such a line that minimizes the sum of squared residuals of the linear regression model. In other words, numbers α and β solve the following minimization problem:
Find , where


Using simple calculus
Calculus
Calculus is a branch of mathematics focused on limits, functions, derivatives, integrals, and infinite series. This subject constitutes a major part of modern mathematics education. It has two major branches, differential calculus and integral calculus, which are related by the fundamental theorem...

 it can be shown that the values of α and β that minimize the objective function Q are


where rxy is the sample correlation coefficient between x and y, sx is the standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

 of x, and sy is correspondingly the standard deviation of y. Horizontal bar over a variable means the sample average of that variable. For example: \overline{xy} = \tfrac{1}{n}\textstyle\sum_{i=1}^n x_iy_i\ .

Substituting the above expressions for and into yields fitted values


which yields


This shows the role rxy plays in the regression line of standardized data points.

If −1 < rxy < 1, then we say that the data points exhibit regression toward the mean. In other words, if linear regression is the appropriate model for a set of data points whose sample correlation coefficient  is not perfect, then there is regression toward the mean. The predicted (or fitted) standardized value of y is closer to its mean than the standardized value of x is to its mean.

Restrictive definition

Let X1, X2 be random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s with identical marginal distributions with mean μ. In this formalization, the bivariate distribution
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

 of X1 and X2 is said to exhibit regression toward the mean if, for every number c > μ, we have
μ ≤ E[X2 | X1 = c] < c,

with the reverse inequalities holding for c < μ.

The following is an informal description of the above definition. Consider a population of widgets
Widget (economics)
The word widget is a placeholder name for an object or, more specifically, a mechanical or other manufactured device. It is an abstract unit of production. The Oxford English Dictionary defines it as "An indefinite name for a gadget or mechanical contrivance, esp. a small manufactured item" and...

. Each widget has two numbers, X1 and X2 (say, its left span (X1 ) and right span (X2)). Suppose that the probability distributions of X1 and X2 in the population are identical, and that the means of X1 and X2 are both μ. We now take a random widget from the population, and denote its X1 value by c. (Note that c may be greater than, equal to, or smaller than μ.) We have no access to the value of this widget's X2 yet. Let d denote the expected value of X2 of this particular widget. (i.e. Let d denote the average value of X2 of all widgets in the population with X1=c.) If the following condition is true:
Whatever the value c is, d lies between μ and c (i.e. d is closer to μ than c is),


then we say that X1 and X2 show regression toward the mean.

This definition accords closely with the current common usage, evolved from Galton's original usage, of the term "regression toward the mean." It is "restrictive" in the sense that not every bivariate distribution with identical marginal distributions exhibits regression toward the mean (under this definition).

Theorem

If a pair (XY) of random variables follows a bivariate normal distribution, then the conditional mean E(Y|X) is a linear function of X. The correlation coefficient
Pearson product-moment correlation coefficient
In statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...

 r between X and Y, along with the marginal means and variances of X and Y, determines this linear relationship:


where EX and EY are the expected values of X and Y, respectively, and σx and σy are the standard deviations of X and Y, respectively.

Hence the conditional expected value of Y, given that X is t standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

s above its mean (and that includes the case where it's below its mean, when t < 0), is rt standard deviations above the mean of Y. Since |r| ≤ 1, Y is no farther from the mean than X is, as measured in the number of standard deviations.

Hence, if 0 ≤ r < 1, then (XY) shows regression toward the mean (by this definition).

General definition

The following definition of reversion toward the mean has been proposed by Samuels as an alternative to the more restrictive definition of regression toward the mean above.

Let X1, X2 be random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s with identical marginal distributions with mean μ. In this formalization, the bivariate distribution
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

of X1 and X2 is said to exhibit reversion toward the mean if, for every number c, we have
μ ≤ E[X2 | X1 > c] < E[X1 | X1 > c], and

μ ≥ E[X2 | X1 < c] > E[X1 | X1 < c]


This definition is "general" in the sense that every bivariate distribution with identical marginal distributions exhibits reversion toward the mean.

External links



The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK