Misuse of statistics
A misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misuse may be accidental. In others, it is purposeful and for the gain of the perpetrator. When the statistical reason involved is false or misapplied, this constitutes a statistical fallacy.
The false statistics trap can be quite damaging to the quest for knowledge. For example, in medical science, correcting a falsehood may take decades and cost lives.
Misuses can be easy to fall into. Professional scientists, even mathematicians and professional statisticians, can be fooled by even some simple methods, even if they are careful to check everything. Scientists have been known to fool themselves with statistics due to lack of knowledge of probability theory
and lack of standardization
of their tests.
Another common technique is to perform a study that tests a large number of dependent (response) variables at the same time. For example, a study testing the effect of a medical treatment might use as dependent variables the probability of survival, the average number of days spent in the hospital, the patient's self-reported level of pain, etc. This also increases the likelihood that at least one of the variables will by chance show a correlation with the independent (explanatory) variable.
will likely result in data skewed in different directions, although they are both polling about the support for the war.
Another way to do this is to precede the question by information that supports the "desired" answer. For example, more people will likely answer "yes" to the question "Given the increasing burden of taxes on middle-class families, do you support cuts in income tax?" than to the question "Considering the rising federal budget deficit and the desperate need for more revenue, do you support cuts in income tax?"
For example, suppose 100% of apples are observed to be red in summer. The assertion "All apples are red" would be an instance of overgeneralization because the original statistic was true only of a specific subset of apples (those in summer), which is not expected to representative of the population of apples as a whole.
A real-world example of the overgeneralization fallacy can be observed as an artifact of modern polling techniques, which prohibit calling cell phones for over-the-phone political polls. As young people are more likely than other demographic groups to have only a cell phone, rather than also having a conventional "landline" phone, young people are more likely to be liberal, and young people who do not own a landline phone are even more likely to be liberal than their demographic as a whole, such polls effectively exclude many voters who are more likely to be liberal.
Thus, a poll examining the voting preferences of young people using this technique could not claim to be representative of young peoples' true voting preferences as a whole without overgeneralizing, because
the sample used is not representative of the population as a whole.
Overgeneralization often occurs when information is passed through nontechnical sources, in particular mass media.
This confidence can actually be quantified by the central limit theorem
and other mathematical results. Confidence is expressed as a probability of the true result (for the larger group) being within a certain range of the estimate (the figure for the smaller group). This is the "plus or minus" figure often quoted for statistical surveys. The probability part of the confidence level is usually not mentioned; if so, it is assumed to be a standard number like 95%.
The two numbers are related. If a survey has an estimated error of ±5% at 95% confidence, it also has an estimated error of ±6.6% at 99% confidence. ±
% at 95% confidence is always ±
% at 99% confidence.
The smaller the estimated error, the larger the required sample, at a given confidence level.
at 95.4%
±1% would require 10,000 people.
±2% would require 2,500 people.
±3% would require 1,111 people.
±4% would require 625 people.
±5% would require 400 people.
±10% would require 100 people.
±20% would require 25 people.
±25% would require 16 people.
±50% would require 4 people.
Most people assume, because the confidence figure is omitted, that there is a 100% certainty that the true result is within the estimated error. This is not mathematically correct.
Many people may not realize that the randomness of the sample is very important. In practice, many opinion polls are conducted by phone, which distorts the sample in several ways, including exclusion of people who do not have phones, favoring the inclusion of people who have more than one phone, favoring the inclusion of people who are willing to participate in a phone survey over those who refuse, etc. Non-random sampling makes the estimated error unreliable.
On the other hand, many people consider that statistics are inherently unreliable because not everybody is called, or because they themselves are never polled. Many people think that it is impossible to get data on the opinion of dozens of millions of people by just polling a few thousands. This is also inaccurate. A poll with perfect unbiased sampling and truthful answers has a mathematically determined margin of error
, which only depends on the number of people polled.
However, often only one margin of error is reported for a survey. When results are reported for population subgroups, a larger margin of error will apply, but this may not be made clear. For example, a survey of 1000 people may contain 100 people from a certain ethnic or economic group. The results focusing on that group will be much less reliable than results for the full population. If the margin of error for the full sample was 4%, say, then the margin of error for such a subgroup could be around 13%.
There are also many other measurement problems in population surveys.
The problems mentioned above apply to all statistical experiments, not just population surveys.
The fifth possibility can be quantified by statistical tests that can calculate the probability that the correlation observed would be as large as it is just by chance if, in fact, there is no relationship between the variables. However, even if that possibility has a small probability, there are still the four others.
If the number of people buying ice cream at the beach is statistically related to the number of people who drown at the beach, then nobody would claim ice cream causes drowning because it's obvious that it isn't so. (In this case, both drowning and ice cream buying are clearly related by a third factor: the number of people at the beach).
This fallacy can be used, for example, to prove that exposure to a chemical causes cancer. Replace "number of people buying ice cream" with "number of people exposed to chemical X", and "number of people who drown" with "number of people who get cancer", and many people will believe you. In such a situation, there may be a statistical correlation even if there is no real effect. For example, if there is a perception that a chemical site "dangerous" (even if it really isn't) property values in the area will decrease, which will entice more low-income families to move to that area. If low-income families are more likely to get cancer than high-income families (this can happen for many reasons, such as a poorer diet or less access to medical care) then rates of cancer will go up, even though the chemical itself is not dangerous. It is believed that this is exactly what happened with some of the early studies showing a link between EMF (electromagnetic field
s) from power lines and cancer.
In well-designed studies, the effect of false causality can be eliminated by assigning some people into a "treatment group" and some people into a "control group" at random, and giving the treatment group the treatment and not giving the control group the treatment. In the above example, a researcher might expose one group of people to chemical X and leave a second group unexposed. If the first group had higher cancer rates, the researcher knows that there is no third factor that affected whether a person was exposed because he controlled who was exposed or not, and he assigned people to the exposed and non-exposed groups at random. However, in many applications, actually doing an experiment in this way is either prohibitively expensive, infeasible, unethical, illegal, or downright impossible. For example, it is highly unlikely that an IRB
would accept an experiment that involved intentionally exposing people to a dangerous substance in order to test its toxicity. The obvious ethical implications of such types of experiments limit researchers ability to empirically test causation.
) is considered valid until enough data proves it wrong. Then
is rejected and the alternative hypothesis (
) is considered to be proven as correct. By chance this can happen, although
is true, with a probability denoted alpha, the significance level. This can be compared by the judicial process, where the accused is considered innocent (
) until proven guilty (
) beyond reasonable doubt (alpha).
But if data does not give us enough proof to reject
, this does not automatically prove that
is correct. If, for example, a tobacco producer wishes to demonstrate that its products are safe, it can easily conduct a test with a small sample of smokers versus a small sample of non-smokers. It is unlikely that any of them will develop lung cancer (and even if they do, the difference between the groups has to be very big in order to reject
). Therefore it is likely—even when smoking is dangerous—that our test will not reject
. If
is accepted, it does not automatically follow that smoking is proven harmless. The test has insufficient power to reject
, so the test is useless and the value of the "proof" of
is also null.
This can—using the judicial analogue above—be compared with the truly guilty defendant who is released just because the proof is not enough for a verdict. This does not prove the defendant's innocence, but only that there is not proof enough for a verdict. In other words, "absence of evidence" does not imply "evidence of absence".
is an abuse of data mining
. In data dredging, large compilations of data are examined in order to find a correlation, without any pre-defined choice of a hypothesis
to be tested. Since the required confidence interval
to establish a relationship between two parameters is usually chosen to be 95% (meaning that there is a 95% chance that the relationship observed is not due to random chance), there is a thus a 5% chance of finding a correlation between any two sets of completely random variables. Given that data dredging efforts typically examine large datasets with many variables, and hence even larger numbers of pairs of variables, spurious but apparently statistically significant results are almost certain to be found by any such study.
Note that data dredging is a valid way of finding a possible hypothesis but that hypothesis must then be tested with data not used in the original dredging. The misuse comes in when that hypothesis is stated as fact without further validation.
) and even simply making up false data.
Examples of selective reporting abound. The easiest and most common examples involve choosing a group of results that follow a pattern consistent with the preferred hypothesis while ignoring other results or "data runs" that contradict the hypothesis.
Psychic researchers have long disputed studies showing people with ESP
ability. Critics accuse ESP proponents of only publishing experiments with positive results and shelving those that show negative results. A "positive result" is a test run (or data run) in which the subject guesses a hidden card, etc., at a much higher frequency than random chance.
The deception involved in both cases is that the hypothesis is not confirmed by the totality of the experiments - only by a tiny, selected group of "successful" tests.
Scientists, in general, question the validity of study results that cannot be reproduced by other investigators. However, some scientists refuse to publish their data and methods.
Also, the post facto fallacy assumes that an event for which a future likelihood can be measured had the same likelihood of happening once it has already occurred. Thus, if someone had already tossed 9 coins and each has come up heads, people tend to assume that the likelihood of a tenth toss also being heads is 1023 to 1 against (which it was before the first coin was tossed) when in fact the chance of the tenth head is 1 to 1. This error has led, in the UK, to the false imprisonment of women for murder when the courts were given the prior statistical likelihood of a woman's 3 children dying from Sudden Infant Death Syndrome
as being the chances that their already dead children died from the syndrome. This led to statements from Roy Meadow
that the chances they had died of Sudden Infant Death Syndrome being millions to one against, convictions were then handed down in spite of the statistical inevitability that a few women would suffer this tragedy. Meadow was subsequently struck off the U.K. Medical Register for giving “erroneous” and “misleading” evidence, although this was later reversed by the courts.
The false statistics trap can be quite damaging to the quest for knowledge. For example, in medical science, correcting a falsehood may take decades and cost lives.
Misuses can be easy to fall into. Professional scientists, even mathematicians and professional statisticians, can be fooled by even some simple methods, even if they are careful to check everything. Scientists have been known to fool themselves with statistics due to lack of knowledge of probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...
and lack of standardization
Standardization is the process of developing and implementing technical standards.The goals of standardization can be to help with independence of single suppliers , compatibility, interoperability, safety, repeatability, or quality....
of their tests.
Discarding unfavorable data
All a company has to do to promote a neutral (useless) product is to find or conduct, for example, 40 studies with a confidence level of 95%. If the product is really useless, this would on average produce one study showing the product was beneficial, one study showing it was harmful and thirty-eight inconclusive studies (38 is 95% of 40). This tactic becomes more effective the more studies there are available. Organizations that do not publish every study they carry out, such as tobacco companies denying a link between smoking and cancer, anti-smoking advocacy groups and media outlets trying to prove a link between smoking and various ailments, or miracle pill vendors, are likely to use this tactic.Another common technique is to perform a study that tests a large number of dependent (response) variables at the same time. For example, a study testing the effect of a medical treatment might use as dependent variables the probability of survival, the average number of days spent in the hospital, the patient's self-reported level of pain, etc. This also increases the likelihood that at least one of the variables will by chance show a correlation with the independent (explanatory) variable.
Loaded questions
The answers to surveys can often be manipulated by wording the question in such a way as to induce a prevalence towards a certain answer from the respondent. For example, in polling support for a war, the questions:- Do you support the attempt by the USA to bring freedom and democracy to other places in the world?
- Do you support the unprovoked military action by the USA?
will likely result in data skewed in different directions, although they are both polling about the support for the war.
Another way to do this is to precede the question by information that supports the "desired" answer. For example, more people will likely answer "yes" to the question "Given the increasing burden of taxes on middle-class families, do you support cuts in income tax?" than to the question "Considering the rising federal budget deficit and the desperate need for more revenue, do you support cuts in income tax?"
Overgeneralization is a fallacy occurring when a statistic about a particular population is asserted to hold among members of a group for which the original population is not a representative sample.For example, suppose 100% of apples are observed to be red in summer. The assertion "All apples are red" would be an instance of overgeneralization because the original statistic was true only of a specific subset of apples (those in summer), which is not expected to representative of the population of apples as a whole.
A real-world example of the overgeneralization fallacy can be observed as an artifact of modern polling techniques, which prohibit calling cell phones for over-the-phone political polls. As young people are more likely than other demographic groups to have only a cell phone, rather than also having a conventional "landline" phone, young people are more likely to be liberal, and young people who do not own a landline phone are even more likely to be liberal than their demographic as a whole, such polls effectively exclude many voters who are more likely to be liberal.
Thus, a poll examining the voting preferences of young people using this technique could not claim to be representative of young peoples' true voting preferences as a whole without overgeneralizing, because
the sample used is not representative of the population as a whole.
Overgeneralization often occurs when information is passed through nontechnical sources, in particular mass media.
Misreporting or misunderstanding of estimated error
If a research team wants to know how 300 million people feel about a certain topic, it would be impractical to ask all of them. However, if the team picks a random sample of about 1000 people, they can be fairly certain that the results given by this group are representative of what the larger group would have said if they had all been asked.This confidence can actually be quantified by the central limit theorem
Central limit theorem
In probability theory, the central limit theorem states conditions under which the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed. The central limit theorem has a number of variants. In its common...
and other mathematical results. Confidence is expressed as a probability of the true result (for the larger group) being within a certain range of the estimate (the figure for the smaller group). This is the "plus or minus" figure often quoted for statistical surveys. The probability part of the confidence level is usually not mentioned; if so, it is assumed to be a standard number like 95%.
The two numbers are related. If a survey has an estimated error of ±5% at 95% confidence, it also has an estimated error of ±6.6% at 99% confidence. ±

The smaller the estimated error, the larger the required sample, at a given confidence level.
at 95.4%
68-95-99.7 rule
In statistics, the 68-95-99.7 rule, or three-sigma rule, or empirical rule, states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean....
±1% would require 10,000 people.
±2% would require 2,500 people.
±3% would require 1,111 people.
±4% would require 625 people.
±5% would require 400 people.
±10% would require 100 people.
±20% would require 25 people.
±25% would require 16 people.
±50% would require 4 people.
Most people assume, because the confidence figure is omitted, that there is a 100% certainty that the true result is within the estimated error. This is not mathematically correct.
Many people may not realize that the randomness of the sample is very important. In practice, many opinion polls are conducted by phone, which distorts the sample in several ways, including exclusion of people who do not have phones, favoring the inclusion of people who have more than one phone, favoring the inclusion of people who are willing to participate in a phone survey over those who refuse, etc. Non-random sampling makes the estimated error unreliable.
On the other hand, many people consider that statistics are inherently unreliable because not everybody is called, or because they themselves are never polled. Many people think that it is impossible to get data on the opinion of dozens of millions of people by just polling a few thousands. This is also inaccurate. A poll with perfect unbiased sampling and truthful answers has a mathematically determined margin of error
Margin of error
The margin of error is a statistic expressing the amount of random sampling error in a survey's results. The larger the margin of error, the less faith one should have that the poll's reported results are close to the "true" figures; that is, the figures for the whole population...
, which only depends on the number of people polled.
However, often only one margin of error is reported for a survey. When results are reported for population subgroups, a larger margin of error will apply, but this may not be made clear. For example, a survey of 1000 people may contain 100 people from a certain ethnic or economic group. The results focusing on that group will be much less reliable than results for the full population. If the margin of error for the full sample was 4%, say, then the margin of error for such a subgroup could be around 13%.
There are also many other measurement problems in population surveys.
The problems mentioned above apply to all statistical experiments, not just population surveys.
False causality
When a statistical test shows a correlation between A and B, there are usually five possibilities:- A causes B.
- B causes A.
- A and B both partly cause each other.
- A and B are both caused by a third factor, C.
- The observed correlation was due purely to chance.
The fifth possibility can be quantified by statistical tests that can calculate the probability that the correlation observed would be as large as it is just by chance if, in fact, there is no relationship between the variables. However, even if that possibility has a small probability, there are still the four others.
If the number of people buying ice cream at the beach is statistically related to the number of people who drown at the beach, then nobody would claim ice cream causes drowning because it's obvious that it isn't so. (In this case, both drowning and ice cream buying are clearly related by a third factor: the number of people at the beach).
This fallacy can be used, for example, to prove that exposure to a chemical causes cancer. Replace "number of people buying ice cream" with "number of people exposed to chemical X", and "number of people who drown" with "number of people who get cancer", and many people will believe you. In such a situation, there may be a statistical correlation even if there is no real effect. For example, if there is a perception that a chemical site "dangerous" (even if it really isn't) property values in the area will decrease, which will entice more low-income families to move to that area. If low-income families are more likely to get cancer than high-income families (this can happen for many reasons, such as a poorer diet or less access to medical care) then rates of cancer will go up, even though the chemical itself is not dangerous. It is believed that this is exactly what happened with some of the early studies showing a link between EMF (electromagnetic field
Electromagnetic field
An electromagnetic field is a physical field produced by moving electrically charged objects. It affects the behavior of charged objects in the vicinity of the field. The electromagnetic field extends indefinitely throughout space and describes the electromagnetic interaction...
s) from power lines and cancer.
In well-designed studies, the effect of false causality can be eliminated by assigning some people into a "treatment group" and some people into a "control group" at random, and giving the treatment group the treatment and not giving the control group the treatment. In the above example, a researcher might expose one group of people to chemical X and leave a second group unexposed. If the first group had higher cancer rates, the researcher knows that there is no third factor that affected whether a person was exposed because he controlled who was exposed or not, and he assigned people to the exposed and non-exposed groups at random. However, in many applications, actually doing an experiment in this way is either prohibitively expensive, infeasible, unethical, illegal, or downright impossible. For example, it is highly unlikely that an IRB
Institutional review board
An institutional review board , also known as an independent ethics committee or ethical review board , is a committee that has been formally designated to approve, monitor, and review biomedical and behavioral research involving humans with the aim to protect the rights and welfare of the...
would accept an experiment that involved intentionally exposing people to a dangerous substance in order to test its toxicity. The obvious ethical implications of such types of experiments limit researchers ability to empirically test causation.
Proof of the null hypothesis
In a statistical test, the null hypothesisNull hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...

But if data does not give us enough proof to reject

This can—using the judicial analogue above—be compared with the truly guilty defendant who is released just because the proof is not enough for a verdict. This does not prove the defendant's innocence, but only that there is not proof enough for a verdict. In other words, "absence of evidence" does not imply "evidence of absence".
Data dredging
Data dredgingData dredging
Data dredging is the inappropriate use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics...
is an abuse of data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
. In data dredging, large compilations of data are examined in order to find a correlation, without any pre-defined choice of a hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...
to be tested. Since the required confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...
to establish a relationship between two parameters is usually chosen to be 95% (meaning that there is a 95% chance that the relationship observed is not due to random chance), there is a thus a 5% chance of finding a correlation between any two sets of completely random variables. Given that data dredging efforts typically examine large datasets with many variables, and hence even larger numbers of pairs of variables, spurious but apparently statistically significant results are almost certain to be found by any such study.
Note that data dredging is a valid way of finding a possible hypothesis but that hypothesis must then be tested with data not used in the original dredging. The misuse comes in when that hypothesis is stated as fact without further validation.
Data manipulation
Informally called "fudging the data," this practice includes selective reporting (see also publication biasPublication bias
Publication bias is the tendency of researchers, editors, and pharmaceutical companies to handle the reporting of experimental results that are positive differently from results that are negative or inconclusive, leading to bias in the overall published literature...
) and even simply making up false data.
Examples of selective reporting abound. The easiest and most common examples involve choosing a group of results that follow a pattern consistent with the preferred hypothesis while ignoring other results or "data runs" that contradict the hypothesis.
Psychic researchers have long disputed studies showing people with ESP
Extra-sensory perception
Extrasensory perception involves reception of information not gained through the recognized physical senses but sensed with the mind. The term was coined by Frederic Myers, and adopted by Duke University psychologist J. B. Rhine to denote psychic abilities such as telepathy, clairaudience, and...
ability. Critics accuse ESP proponents of only publishing experiments with positive results and shelving those that show negative results. A "positive result" is a test run (or data run) in which the subject guesses a hidden card, etc., at a much higher frequency than random chance.
The deception involved in both cases is that the hypothesis is not confirmed by the totality of the experiments - only by a tiny, selected group of "successful" tests.
Scientists, in general, question the validity of study results that cannot be reproduced by other investigators. However, some scientists refuse to publish their data and methods.
Other fallacies
- N = 1 fallacy
Also, the post facto fallacy assumes that an event for which a future likelihood can be measured had the same likelihood of happening once it has already occurred. Thus, if someone had already tossed 9 coins and each has come up heads, people tend to assume that the likelihood of a tenth toss also being heads is 1023 to 1 against (which it was before the first coin was tossed) when in fact the chance of the tenth head is 1 to 1. This error has led, in the UK, to the false imprisonment of women for murder when the courts were given the prior statistical likelihood of a woman's 3 children dying from Sudden Infant Death Syndrome
Sudden infant death syndrome
Sudden infant death syndrome is marked by the sudden death of an infant that is unexpected by medical history, and remains unexplained after a thorough forensic autopsy and a detailed death scene investigation. An infant is at the highest risk for SIDS during sleep, which is why it is sometimes...
as being the chances that their already dead children died from the syndrome. This led to statements from Roy Meadow
Roy Meadow
Sir Samuel Roy Meadow is a British paediatrician and professor, who rose to initial fame for his 1977 academic paper on the now controversial Munchausen Syndrome by Proxy and his crusade against parents who, he believes, wilfully harm or kill their children. He was knighted for these works...
that the chances they had died of Sudden Infant Death Syndrome being millions to one against, convictions were then handed down in spite of the statistical inevitability that a few women would suffer this tragedy. Meadow was subsequently struck off the U.K. Medical Register for giving “erroneous” and “misleading” evidence, although this was later reversed by the courts.