P-value
Encyclopedia
In statistical significance
testing, the p-value is the probability
of obtaining a test statistic
at least as extreme as the one that was actually observed, assuming that the null hypothesis
is true. One often "rejects the null hypothesis" when the p-value is less than the significance level
α (Greek alpha), which is often 0.05 or 0.1. When the null hypothesis is rejected, the result is said to be statistically significant
.
A closely related concept is the E-value, which is the average number of times in multiple testing that one expects to obtain a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The E-value is the product of the number of tests and the p-value.
Although there is often confusion, the p-value is not the probability of the null hypothesis being true, nor is the p-value the same as the Type I error rate, .
(50% chance, each, of landing heads or tails) or unfairly biased (≠ 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin would result in 14 or more heads can be computed from binomial coefficient
s as
This probability is the (one-sided) p-value. It measures the chance that a fair coin would give a result at least this extreme.
if the p-value is smaller than or equal to the significance level
, often represented by the Greek letter α (alpha
). (Greek α is also used for Type I error; the connection is that a hypothesis test that rejects the null hypothesis for all samples that have a p-value less than α will have a Type I error of α.) A significance level of 0.05 would deem as extraordinary any result that is within the most extreme 5% of all possible results under the null hypothesis. In this case a p-value less than 0.05 would result in the rejection of the null hypothesis at the 5% (significance) level.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of heads and tails. In this case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14 heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity in either direction (two-sided test
). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is symmetrical for a fair coin, the two-sided p-value is simply twice the above calculated single-sided p-value; i.e., the two-sided p-value is 0.115.
In the above example we thus have:
The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of 14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time were the coin in fact fair. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall evenly, the deviation from expected outcome is small enough to be consistent with chance.
However, had one more head been obtained, the resulting p-value (two-tailed) would have been 0.0414 (4.14%). This time the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected when using a 5% cut-off.
To understand both the original purpose of the p-value p and the reasons p is so often misinterpreted, it helps to know that p constitutes the main result of statistical significance testing (not to be confused with hypothesis testing), popularized by Ronald A. Fisher. Fisher promoted this testing as a method of statistical inference. To call this testing inferential is misleading, however, since inference makes statements about general hypotheses based on observed data, such as the post-experimental probability a hypothesis is true. As explained above, p is instead a statement about data assuming the null hypothesis; consequently, indiscriminately considering p as an inferential result can lead to confusion, including many of the misinterpretations noted in the next section.
On the other hand, Bayesian inference
, the main alternative to significance testing, generates probabilistic statements about hypotheses based on data (and a priori estimates), and therefore truly constitutes inference. Bayesian methods can, for instance, calculate the probability that the null hypothesis H0 above is true assuming an a priori estimate of the probability that a coin is unfair. Since a priori we would be quite surprised that a coin could consistently give 75% heads, a Bayesian analysis would find the null hypothesis (that the coin is fair) quite probable even if a test gave 15 heads out of 20 tries (which as we saw above is considered a "significant" result at the 5% level according to its p-value).
Strictly speaking, then, p is a statement about data rather than about any hypothesis, and hence it is not inferential. This raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many situations, p approximates some useful post-experimental probabilities about hypotheses, such as the post-experimental probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the post-experimental plausibility of a hypothesis.
Even so, this approximation does not eliminate the need for caution in interpreting p inferentially, as shown in the Jeffreys–Lindley paradox mentioned below.
correct.
Despite the ubiquity of p-value tests, this particular test for statistical significance has come under heavy criticism due both to its inherent shortcomings and the potential for misinterpretation.
There are several common misunderstandings about p-values.
As noted above, the p-value p is the main result of statistical significance testing. Fisher proposed p as an informal measure of evidence against the null hypothesis. He called researchers to combine p in the mind with other types of evidence for and against that hypothesis, such as the a priori plausibility of the hypothesis and the relative strengths of results from previous studies. Many misunderstandings concerning p arise because statistics classes and instructional materials ignore or at least do not emphasize the role of prior evidence in interpreting p. A renewed emphasis on prior evidence could encourage researchers to place p in the proper context, evaluating a hypothesis by weighing p together with all the other evidence about the hypothesis.
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
testing, the p-value is the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...
of obtaining a test statistic
Test statistic
In statistical hypothesis testing, a hypothesis test is typically specified in terms of a test statistic, which is a function of the sample; it is considered as a numerical summary of a set of data that...
at least as extreme as the one that was actually observed, assuming that the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
is true. One often "rejects the null hypothesis" when the p-value is less than the significance level
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
α (Greek alpha), which is often 0.05 or 0.1. When the null hypothesis is rejected, the result is said to be statistically significant
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
.
A closely related concept is the E-value, which is the average number of times in multiple testing that one expects to obtain a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The E-value is the product of the number of tests and the p-value.
Although there is often confusion, the p-value is not the probability of the null hypothesis being true, nor is the p-value the same as the Type I error rate, .
Coin flipping example
For example, an experiment is performed to determine whether a coin flip is fairFair coin
In probability theory and statistics, a sequence of independent Bernoulli trials with probability 1/2 of success on each trial is metaphorically called a fair coin. One for which the probability is not 1/2 is called a biased or unfair coin...
(50% chance, each, of landing heads or tails) or unfairly biased (≠ 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin would result in 14 or more heads can be computed from binomial coefficient
Binomial coefficient
In mathematics, binomial coefficients are a family of positive integers that occur as coefficients in the binomial theorem. They are indexed by two nonnegative integers; the binomial coefficient indexed by n and k is usually written \tbinom nk , and it is the coefficient of the x k term in...
s as
This probability is the (one-sided) p-value. It measures the chance that a fair coin would give a result at least this extreme.
Interpretation
Traditionally, one rejects the null hypothesisNull hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
if the p-value is smaller than or equal to the significance level
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
, often represented by the Greek letter α (alpha
Alpha
Alpha is the first letter of the Greek alphabet. Alpha or ALPHA may also refer to:-Science:*Alpha , the highest ranking individuals in a community of social animals...
). (Greek α is also used for Type I error; the connection is that a hypothesis test that rejects the null hypothesis for all samples that have a p-value less than α will have a Type I error of α.) A significance level of 0.05 would deem as extraordinary any result that is within the most extreme 5% of all possible results under the null hypothesis. In this case a p-value less than 0.05 would result in the rejection of the null hypothesis at the 5% (significance) level.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of heads and tails. In this case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14 heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity in either direction (two-sided test
Two-tailed test
The two-tailed test is a statistical test used in inference, in which a given statistical hypothesis, H0 , will be rejected when the value of the test statistic is either sufficiently small or sufficiently large...
). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is symmetrical for a fair coin, the two-sided p-value is simply twice the above calculated single-sided p-value; i.e., the two-sided p-value is 0.115.
In the above example we thus have:
- null hypothesis (H0): fair coin; P(heads) = 0.5
- observation O: 14 heads out of 20 flips; and
- p-value of observation O given H0 = Prob(≥ 14 heads or ≥ 14 tails) = 0.115.
The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of 14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time were the coin in fact fair. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall evenly, the deviation from expected outcome is small enough to be consistent with chance.
However, had one more head been obtained, the resulting p-value (two-tailed) would have been 0.0414 (4.14%). This time the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected when using a 5% cut-off.
To understand both the original purpose of the p-value p and the reasons p is so often misinterpreted, it helps to know that p constitutes the main result of statistical significance testing (not to be confused with hypothesis testing), popularized by Ronald A. Fisher. Fisher promoted this testing as a method of statistical inference. To call this testing inferential is misleading, however, since inference makes statements about general hypotheses based on observed data, such as the post-experimental probability a hypothesis is true. As explained above, p is instead a statement about data assuming the null hypothesis; consequently, indiscriminately considering p as an inferential result can lead to confusion, including many of the misinterpretations noted in the next section.
On the other hand, Bayesian inference
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
, the main alternative to significance testing, generates probabilistic statements about hypotheses based on data (and a priori estimates), and therefore truly constitutes inference. Bayesian methods can, for instance, calculate the probability that the null hypothesis H0 above is true assuming an a priori estimate of the probability that a coin is unfair. Since a priori we would be quite surprised that a coin could consistently give 75% heads, a Bayesian analysis would find the null hypothesis (that the coin is fair) quite probable even if a test gave 15 heads out of 20 tries (which as we saw above is considered a "significant" result at the 5% level according to its p-value).
Strictly speaking, then, p is a statement about data rather than about any hypothesis, and hence it is not inferential. This raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many situations, p approximates some useful post-experimental probabilities about hypotheses, such as the post-experimental probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the post-experimental plausibility of a hypothesis.
Even so, this approximation does not eliminate the need for caution in interpreting p inferentially, as shown in the Jeffreys–Lindley paradox mentioned below.
Misunderstandings
The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). A small p-value that indicates statistical significance does not indicate that an alternative hypothesis is ipso factoIpso facto
Ipso facto is a Latin phrase, directly translated as "by the fact itself," which means that a certain phenomenon is a direct consequence, a resultant effect, of the action in question, instead of being brought about by a subsequent action such as the verdict of a tribunal. It is a term of art used...
correct.
Despite the ubiquity of p-value tests, this particular test for statistical significance has come under heavy criticism due both to its inherent shortcomings and the potential for misinterpretation.
There are several common misunderstandings about p-values.
- The p-value is not the probability that the null hypothesis is true.
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of BayesianBayesian probabilityBayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
and classical approaches shows that a p-value can be very close to zero while the posterior probabilityPosterior probabilityIn Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account...
of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is the Jeffreys–Lindley paradox. - The p-value is not the probability that a finding is "merely a fluke."
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true. - The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacyProsecutor's fallacyThe prosecutor's fallacy is a fallacy of statistical reasoning made in law where the context in which the accused has been brought to court is falsely assumed to be irrelevant to judging how confident a jury can be in evidence against them with a statistical measure of doubt...
. - The p-value is not the probability that a replicating experiment would not yield the same conclusion.
- 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
- The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows the reader to decide for himself whether to consider the results significant.) - The p-value does not indicate the size or importance of the observed effect (compare with effect sizeEffect sizeIn statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...
). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.
Problems
Critics of p-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05). If significance testing is applied to hypotheses that are known to be false in advance, an insignificant result will simply reflect an insufficient sample size. Another problem is that the definition of "more extreme" data depends on the intentions of the investigator; for example, the situation in which the investigator flips the coin 100 times has a set of extreme data that is different from the situation in which the investigator continues to flip the coin until 50 heads are achieved.As noted above, the p-value p is the main result of statistical significance testing. Fisher proposed p as an informal measure of evidence against the null hypothesis. He called researchers to combine p in the mind with other types of evidence for and against that hypothesis, such as the a priori plausibility of the hypothesis and the relative strengths of results from previous studies. Many misunderstandings concerning p arise because statistics classes and instructional materials ignore or at least do not emphasize the role of prior evidence in interpreting p. A renewed emphasis on prior evidence could encourage researchers to place p in the proper context, evaluating a hypothesis by weighing p together with all the other evidence about the hypothesis.
See also
- Binomial testBinomial testIn statistics, the binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution of observations into two categories.-Common use:...
- CounternullCounternullIn statistics, and especially in the statistical analysis of psychological data, the counternull is a statistic used to aid the understanding and presentation of research results...
- Fisher's MethodFisher's MethodIn statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" . It was developed by and named for Ronald Fisher...
- Generalized p-valueGeneralized p-valueIn statistics, a generalized p-value is an extended version of the classical p-value, which except in a limited number of applications, provide only approximate solutions....
- p-repP-repIn statistical hypothesis testing, p-rep or prep has been proposed as a statistical to the classic p-value. Whereas a p-value is the probability of obtaining a result under the null hypothesis, p-rep computes the probability of replicating an effect...
- Statistical hypothesis testingStatistical hypothesis testingA statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...
- T-value
- Bayesian inferenceBayesian inferenceIn statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
, use of prior estimates - False Discovery RateFalse discovery rateFalse discovery rate control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses...
Further reading
- Dallal GE (2007) Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance
- Hubbard R, Armstrong JS (2005) Historical background on the widespread confusion of the p-value (PDF)
- Fisher's methodFisher's MethodIn statistics, Fisher's method, also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" . It was developed by and named for Ronald Fisher...
for combining independentStatistical independenceIn probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
testExperimentAn experiment is a methodical procedure carried out with the goal of verifying, falsifying, or establishing the validity of a hypothesis. Experiments vary greatly in their goal and scale, but always rely on repeatable procedure and logical analysis of the results...
s of significance using their p-values - Dallal GE (2007) The Little Handbook of Statistical Practice (A tutorial)
- 12 Misconceptions, good overview given in following Article
External links
- Free online p-values calculators for various specific tests (chi-square, Fisher's F-test, etc.).
- Understanding P-values, including a Java applet that illustrates how the numerical values of p-values can give quite misleading impressions about the truth or falsity of the hypothesis under test.