Fisher's exact test
Encyclopedia
Fisher's exact test is a statistical significance
test used in the analysis of contingency table
s where sample
sizes are small. It is named after its inventor, R. A. Fisher
, and is one of a class of exact test
s, so called because the significance of the deviation from a null hypothesis
can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Dr Muriel Bristol
, who claimed to be able to detect whether the tea or the milk was added first to her cup; see lady tasting tea
.
that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value
from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.
With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution
of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom
(this rule is now known to be overly conservative). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest. In contrast the Fisher test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.
For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation (sometimes using a Monte Carlo method
to obtain an approximation) for the more general case.
These data would not be suitable for analysis by a chi-squared test, because the expected values in the table are all below 10, and in a 2 × 2 contingency table, the number of degrees of freedom is always 1.
The question we ask about these data is: knowing that 10 of these 24 teenagers are dieters, and that 12 of the 24 are female, what is the probability that these 10 dieters would be so unevenly distributed between the women and the men? If we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the 12 women, and only 1 from among the 12 men?
Before we proceed with the Fisher test, we first introduce some notation. We represent the cells by the letters a, b, c and d, call the totals across rows and columns marginal totals, and represent the grand total by n. So the table now looks like this:
Fisher showed that the probability
of obtaining any such set of values was given by the hypergeometric distribution:
where is the binomial coefficient
and the symbol ! indicates the factorial operator
.
This formula gives the exact probability of observing this particular arrangement of the data, assuming the given marginal totals, on the null hypothesis
that men and women are equally likely to be dieters. To put it another way, if we assume that the probability that a man is a dieter is p, the probability that a woman is a dieter is p, and we assume that both men and women enter our sample independently of whether or not they are dieters, then this hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells, conditionally on the observed marginals. We can calculate the exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to generate a significance level, we need consider only the cases where the marginal totals are the same as in the observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement, or more so. (Barnard's test
relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such cases. Of these only one is more extreme in the same direction as our data; it looks like this:
In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the null hypothesis
is true, we have to calculate the values of p for both these tables, and add them together. This gives a one-tailed test; for a two-tailed test
we must also consider tables that are equally extreme but in the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is problematic. An approach used in the CRAN R statistical system is to compute the p-value by summing the probabilities for all tables with probabilities less than or equal to that of the observed table. For tables with small counts, the 2-sided p-value can differ substantially from twice the 1-sided value, unlike the case with test statistics that have a symmetric sampling distribution.
As noted above, most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma function
or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area.
Another early discussion revolved around the necessity to condition on the marginals. Fisher's test gives exact p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random marginals. Some authors (including, later, Barnard himself) have criticized Barnard's test based on this property. They argue that the marginal totals are an (almost) ancillary statistic
, containing (almost) no information about the tested property.
estimates to calculate a p-value
from the exact binomial or multinomial distributions and accept or reject based on the p-value
.
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
test used in the analysis of contingency table
Contingency table
In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables...
s where sample
Sample (statistics)
In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size...
sizes are small. It is named after its inventor, R. A. Fisher
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...
, and is one of a class of exact test
Exact test
In statistics, an exact test is a test where all assumptions upon which the derivation of the distribution of the test statistic is based are met, as opposed to an approximate test, in which the approximation may be made as close as desired by making the sample size big enough...
s, so called because the significance of the deviation from a null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Dr Muriel Bristol
Muriel Bristol
B. Muriel Bristol-Roach, Ph.D., was a scientist working in the field of alga biology who worked at the Rothamsted Experimental Station in 1919. In addition to her scientific work, she is notable for being the woman whose claim to be able to tell whether the milk or the tea was poured into a cup...
, who claimed to be able to detect whether the tea or the milk was added first to her cup; see lady tasting tea
Lady tasting tea
In the design of experiments in statistics, the lady tasting tea is a famous randomized experiment devised by Ronald A. Fisher and reported in his book Statistical methods for research workers . The lady in question was Dr...
.
Purpose and scope
The test is useful for categorical dataCategorical data
In statistics, categorical data is that part of an observed dataset that consists of categorical variables, or for data that has been converted into that form, for example as grouped data...
that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...
from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.
With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution
Sampling distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification on the route to statistical inference...
of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom
Degrees of freedom (statistics)
In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...
(this rule is now known to be overly conservative). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest. In contrast the Fisher test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.
For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation (sometimes using a Monte Carlo method
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...
to obtain an approximation) for the more general case.
Example
For example, a sample of teenagers might be divided into male and female on the one hand, and those that are and are not currently dieting on the other. We hypothesize, for example, that the proportion of dieting individuals is higher among the women than among the men, and we want to test whether any difference of proportions that we observe is significant. The data might look like this:Men | Women | Row total | |
---|---|---|---|
Dieting | 1 | 9 | 10 |
Non-dieting | 11 | 3 | 14 |
Col. total | 12 | 12 | 24 |
These data would not be suitable for analysis by a chi-squared test, because the expected values in the table are all below 10, and in a 2 × 2 contingency table, the number of degrees of freedom is always 1.
The question we ask about these data is: knowing that 10 of these 24 teenagers are dieters, and that 12 of the 24 are female, what is the probability that these 10 dieters would be so unevenly distributed between the women and the men? If we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the 12 women, and only 1 from among the 12 men?
Before we proceed with the Fisher test, we first introduce some notation. We represent the cells by the letters a, b, c and d, call the totals across rows and columns marginal totals, and represent the grand total by n. So the table now looks like this:
Men | Women | Total | |
---|---|---|---|
Dieting | a | b | a + b |
Non-dieting | c | d | c + d |
Totals | a + c | b + d | a + b + c + d (=n) |
Fisher showed that the probability
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...
of obtaining any such set of values was given by the hypergeometric distribution:
where is the binomial coefficient
Binomial coefficient
In mathematics, binomial coefficients are a family of positive integers that occur as coefficients in the binomial theorem. They are indexed by two nonnegative integers; the binomial coefficient indexed by n and k is usually written \tbinom nk , and it is the coefficient of the x k term in...
and the symbol ! indicates the factorial operator
Factorial
In mathematics, the factorial of a non-negative integer n, denoted by n!, is the product of all positive integers less than or equal to n...
.
This formula gives the exact probability of observing this particular arrangement of the data, assuming the given marginal totals, on the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
that men and women are equally likely to be dieters. To put it another way, if we assume that the probability that a man is a dieter is p, the probability that a woman is a dieter is p, and we assume that both men and women enter our sample independently of whether or not they are dieters, then this hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells, conditionally on the observed marginals. We can calculate the exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to generate a significance level, we need consider only the cases where the marginal totals are the same as in the observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement, or more so. (Barnard's test
Barnard's test
In statistics, Barnard's test is an exact test of the null hypothesis of independence of rows and columns in a contingency table. It is an alternative to Fisher's exact test but is more time-consuming to compute...
relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such cases. Of these only one is more extreme in the same direction as our data; it looks like this:
Men | Women | Total | |
---|---|---|---|
Dieting | 0 | 10 | 10 |
Non-dieting | 12 | 2 | 14 |
Totals | 12 | 12 | 24 |
In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
is true, we have to calculate the values of p for both these tables, and add them together. This gives a one-tailed test; for a two-tailed test
Two-tailed test
The two-tailed test is a statistical test used in inference, in which a given statistical hypothesis, H0 , will be rejected when the value of the test statistic is either sufficiently small or sufficiently large...
we must also consider tables that are equally extreme but in the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is problematic. An approach used in the CRAN R statistical system is to compute the p-value by summing the probabilities for all tables with probabilities less than or equal to that of the observed table. For tables with small counts, the 2-sided p-value can differ substantially from twice the 1-sided value, unlike the case with test statistics that have a symmetric sampling distribution.
As noted above, most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma function
Gamma function
In mathematics, the gamma function is an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers...
or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area.
Controversies
Despite the fact that Fisher's test gives exact p-values, some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level. The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels. To be more precise, consider the following proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which equality is achieved. If is the largest p-value smaller than 5% which can actually occur for some table, then the proposed test effectively tests at the -level. For small sample sizes, might be significantly lower than 5%. While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals. To avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete problems.Another early discussion revolved around the necessity to condition on the marginals. Fisher's test gives exact p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random marginals. Some authors (including, later, Barnard himself) have criticized Barnard's test based on this property. They argue that the marginal totals are an (almost) ancillary statistic
Ancillary statistic
In statistics, an ancillary statistic is a statistic whose sampling distribution does not depend on which of the probability distributions among those being considered is the distribution of the statistical population from which the data were taken...
, containing (almost) no information about the tested property.
Alternatives
An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihoodMaximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
estimates to calculate a p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...
from the exact binomial or multinomial distributions and accept or reject based on the p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...
.
See also
- Barnard's exact test