Testing hypotheses suggested by the data

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, hypotheses
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

suggested by the data, if tested using the data set that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning would be involved: something seems true in the limited data set, therefore we hypothesize that it is true in general, therefore we (wrongly) test it on the same limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as post hoc theorizing.

The correct procedure is to test any hypothesis on a data set that was not used to generate the hypothesis.

Example of fallacious acceptance of a hypothesis

Suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in treating cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placebo

Placebo

A placebo is a simulated or otherwise medically ineffectual treatment for a disease or other medical condition intended to deceive the recipient...

. The fiftieth study finds a big difference, but the difference is of a size that one would expect to see in about one of every fifty studies even if vitamin X has no effect at all, just due to chance (with patients who were going to get better anyway disproportionately ending up in the Vitamin X group instead of the control group, which can happen since the entire population of cancer patients cannot be included in the study). When all fifty studies are pooled, one would say no effect of Vitamin X was found, because the positive result was not more frequent than chance, i.e. it was not statistically significant

Statistical significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....

. However, it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, at least until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was by chance the one-in-fifty in which an extreme value of the test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious

Fallacy

In logic and rhetoric, a fallacy is usually an incorrect argumentation in reasoning resulting in a misconception or presumption. By accident or design, fallacies may exploit emotional triggers in the listener or interlocutor , or take advantage of social relationships between people...

to cite the data as serious evidence for this particular hypothesis suggested by the data.

However, if another study is then done in Denmark and again finds a difference between the vitamin and the placebo, then the first study strengthens the case provided by the second study. Or, if a second series of studies is done on fifty countries, and Denmark stands out in the second study as well, the two series together constitute important evidence even though neither by itself is at all impressive.

The general problem

Testing a hypothesis suggested by the data can very easily result in false positives (type I errors). If one looks long enough and in enough different places, eventually data can be found to support any hypothesis. Unfortunately, these positive data do not by themselves constitute evidence

Scientific evidence

Scientific evidence has no universally accepted definition but generally refers to evidence which serves to either support or counter a scientific theory or hypothesis. Such evidence is generally expected to be empirical and properly documented in accordance with scientific method such as is...

that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the same experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments, completed or potential, has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place.

A large set of tests as described above greatly inflates the probability

Probability

Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

of type I error as all but the data most favorable to the hypothesis

Hypothesis

A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

is discarded. This is a risk, not only in hypothesis testing

Statistical hypothesis testing

A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

but in all statistical inference

Statistical inference

In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

as it is often problematic to accurately describe the process that has been followed in searching and discarding data

Data

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

. In other words, one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in statistical model

Statistical model

A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

ling, where many different models are rejected by trial and error

Trial and error

Trial and error, or trial by error, is a general method of problem solving, fixing things, or for obtaining knowledge."Learning doesn't happen from failure itself but rather from analyzing the failure, making a change, and then trying again."...

before publishing a result (see also overfitting

Overfitting

In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...

, Publication bias

Publication bias

Publication bias is the tendency of researchers, editors, and pharmaceutical companies to handle the reporting of experimental results that are positive differently from results that are negative or inconclusive, leading to bias in the overall published literature...

).

The error is particularly prevalent in data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

and machine learning

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

. It also commonly occurs in academic publishing

Academic publishing

Academic publishing describes the subfield of publishing which distributes academic research and scholarship. Most academic work is published in journal article, book or thesis form. The part of academic written output that is not formally published but merely printed up or posted is often called...

where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as publication bias

Publication bias

Correct procedures

All strategies for sound testing of hypotheses suggested by the data involve including a wider range of tests in an attempt to validate or refute the new hypothesis. These include:

Collecting confirmation samples
Cross-validation
Methods of compensation for multiple comparisons
Multiple comparisons
In statistics, the multiple comparisons or multiple testing problem occurs when one considers a set of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly...
Simulation studies including adequate representation of the multiple-testing actually involved

Henry Scheffé's simultaneous test of all contrasts in multiple comparison

Multiple comparisons

In statistics, the multiple comparisons or multiple testing problem occurs when one considers a set of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly...

problems is the most well-known remedy in the case of analysis of variance

Analysis of variance

In statistics, analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation...

. It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above.

Example of fallacious acceptance of a hypothesis

The general problem

Correct procedures

See also