Odds ratio
Encyclopedia
The odds ratio is a measure of effect size
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity...

, describing the strength of association
Association (statistics)
In statistics, an association is any relationship between two measured quantities that renders them statistically dependent. The term "association" refers broadly to any such relationship, whereas the narrower term "correlation" refers to a linear relationship between two quantities.There are many...

 or non-independence between two binary data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 values. It is used as a descriptive statistic
Descriptive statistics
Descriptive statistics quantitatively describe the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics , in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are...

, and plays an important role in logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

. Unlike other measures of association for paired binary data such as the relative risk
Relative risk
In statistics and mathematical epidemiology, relative risk is the risk of an event relative to exposure. Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group....

, the odds ratio treats the two variables being compared symmetrically, and can be estimated using some types of non-random samples.

Definition in terms of group-wise odds

The odds ratio is the ratio of the odds
Odds
The odds in favor of an event or a proposition are expressed as the ratio of a pair of integers, which is the ratio of the probability that an event will happen to the probability that it will not happen...

 of an event occurring in one group to the odds of it occurring in another group. The term is also used to refer to sample-based estimates of this ratio. These groups might be men and women, an experimental group and a control group, or any other dichotomous
Dichotomy
A dichotomy is any splitting of a whole into exactly two non-overlapping parts, meaning it is a procedure in which a whole is divided into two parts...

 classification. If the probabilities of the event in each of the groups are p1 (first group) and p2 (second group), then the odds ratio is:


where qx = 1 − px. An odds ratio of 1 indicates that the condition or event under study is equally likely to occur in both groups. An odds ratio greater than 1 indicates that the condition or event is more likely to occur in the first group. And an odds ratio less than 1 indicates that the condition or event is less likely to occur in the first group. The odds ratio must be nonnegative if it is defined. It is undefined if p2q1 equals zero, i.e., if p2 equals zero or p1 equals one.

Definition in terms of joint and conditional probabilities

The odds ratio can also be defined in terms of the joint probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

 of two binary random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s. The joint distribution of binary random variables X and Y can be written


Y = 1 Y = 0
X = 1
X = 0


where p11, p10, p01 and p00 are non-negative "cell probabilities" that sum to one. The odds for Y within the two subpopulations defined by X = 1 and X = 0 are defined in terms of the conditional probabilities given X:


Y = 1 Y = 0
X = 1
X = 0


Thus the odds ratio is


The simple expression on the right, above, is easy to remember as the product of the probabilities of the "concordant cells" (X = Y) divided by the product of the probabilities of the "discordant cells" (X ≠ Y). However note that in some applications the labeling of categories as zero and one is arbitrary, so there is nothing special about concordant versus discordant values in these applications.

Symmetry

If we had calculated the odds ratio based on the conditional probabilities given Y,


Y = 1 Y = 0
X = 1
X = 0


we would have gotten the same result


Other measures of effect size for binary data such as the relative risk
Relative risk
In statistics and mathematical epidemiology, relative risk is the risk of an event relative to exposure. Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group....

 do not have this symmetry property.

Relation to statistical independence

If X and Y are independent, their joint probabilities can be expressed in terms of their marginal probabilities px =  P(X = 1) and py =  P(Y = 1), as follows


Y = 1 Y = 0
X = 1
X = 0


In this case, the odds ratio equals one, and conversely the odds ratio can only equal one if the joint probabilities can be factored in this way. Thus the odds ratio equals one if and only if X and Y are independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

.

Recovering the cell probabilities from the odds ratio and marginal probabilities

The odds ratio is a function of the cell probabilities, and conversely, the cell probabilities can be recovered given knowledge of the odds ratio and the marginal probabilities P(X = 1) = p11 + p10 and P(Y = 1) = p11 + p01. If the odds ratio R differs from 1, then


where p1• = p11 + p10,  p•1 = p11 + p01, and


In the case where R = 1, we have independence, so p11 = p1•p•1.

Once we have p11, the other three cell probabilities can easily be recovered from the marginal probabilities.

Example

Suppose that in a sample of 100 men, 90 drank wine in the previous week, while in a sample of 100 women only 20 drank wine in the same period. The odds of a man drinking wine are 90 to 10, or 9:1, while the odds of a woman drinking wine are only 20 to 80, or 1:4 = 0.25:1. The odds ratio is thus 9/0.25, or 36, showing that men are much more likely to drink wine than women. The detailed calculation is:


This example also shows how odds ratios are sometimes sensitive in stating relative positions: in this sample men are 90/20 = 4.5 times more likely to have drunk wine than women, but have 36 times the odds. The logarithm of the odds ratio, the difference of the logit
Logit
The logit function is the inverse of the sigmoidal "logistic" function used in mathematics, especially in statistics.Log-odds and logit are synonyms.-Definition:The logit of a number p between 0 and 1 is given by the formula:...

s of the probabilities
Probability
Probability is ordinarily used to describe an attitude of mind towards some proposition of whose truth we arenot certain. The proposition of interest is usually of the form "Will a specific event occur?" The attitude of mind is of the form "How certain are we that the event will occur?" The...

, tempers this effect, and also makes the measure symmetric
Symmetry
Symmetry generally conveys two primary meanings. The first is an imprecise sense of harmonious or aesthetically pleasing proportionality and balance; such that it reflects beauty or perfection...

 with respect to the ordering of groups. For example, using natural logarithms, an odds ratio of 36/1 maps to 3.584, and an odds ratio of 1/36 maps to −3.584.

Statistical inference

Several approaches to statistical inference for odds ratios have been developed.

One approach to inference uses large sample approximations to the sampling distribution of the log odds ratio (the natural logarithm
Natural logarithm
The natural logarithm is the logarithm to the base e, where e is an irrational and transcendental constant approximately equal to 2.718281828...

 of the odds ratio). If we use the joint probability notation defined above, the population log odds ratio is


If we observe data in the form of a contingency table
Contingency table
In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables...



Y = 1 Y = 0
X = 1
X = 0


then the probabilities in the joint distribution can be estimated as

Y = 1 Y = 0
X = 1
X = 0


where p''̂ = nij / n, with n = n11 + n10 + n01 + n00 being the sum of all four cell counts. The sample log odds ratio is
.

The distribution of the log odds ratio is approximately normal with:



The standard error
Standard error (statistics)
The standard error is the standard deviation of the sampling distribution of a statistic. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate....

 for the log odds ratio is approximately
.

This is an asymptotic approximation, and will not give a meaningful result if any of the cell counts are very small. If L is the sample log odds ratio, an approximate 95% confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

 for the population log odds ratio is
L ± 1.96SE. This can be mapped to exp(L − 1.96SE), exp(L + 1.96SE) to obtain a 95% confidence interval for the odds ratio. If we wish to test the hypothesis that the population odds ratio equals one, the two-sided p-value
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...

 is 2
P(Z< −|L|/SE), where P denotes a probability, and Z denotes a standard normal random variable.

An alternative approach to inference for odds ratios looks at the distribution of the data conditionally on the marginal frequencies of
X and Y. An advantage of this approach is that the sampling distribution of the odds ratio can be expressed exactly.

Role in logistic regression

Logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

 is one way to generalize the odds ratio beyond two binary variables. Suppose we have a binary response variable
Y and a binary predictor variable X, and in addition we have other predictor variables Z1, ..., Zp that may or may not be binary. If we use multiple logistic regression to regress Y on X, Z1, ..., Zp, then the estimated coefficient for X is related to a conditional odds ratio. Specifically, at the population level


so is an estimate of this conditional odds ratio. The interpretation of is as an estimate of the odds ratio between Y and X when the values of Z1, ..., Zp are held fixed.

Insensitivity to the type of sampling

If the data form a "population sample", then the cell probabilities
p''̂ij are interpreted as the frequencies of each of the four groups in the population as defined by their X and Y values. In many settings it is impractical to obtain a population sample, so a selected sample is used. For example, we may choose to sample units with X = 1 with a given probability f, regardless of their frequency in the population (which would necessitate sampling units with X = 0 with probability 1 − f). In this situation, our data would follow the following joint probabilities:


Y = 1 Y = 0
X = 1
X = 0


The odds ratio p11p00 / p01p10 for this distribution does not depend on the value of f. This shows that the odds ratio (and consequently the log odds ratio) is invariant to non-random sampling based on one of the variables being studied. Note however that the standard error of the log odds ratio does depend on the value of f. This fact is exploited in two important situations:
  • Suppose it is inconvenient or impractical to obtain a population sample, but it is practical to obtain a convenience sample
    Accidental sampling
    Accidental sampling is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a sample population selected because it is readily available and convenient...

     of units with different X values, such that within the X = 0 and X = 1 subsamples the Y values are representative of the population (i.e. they follow the correct conditional probabilities).

  • Suppose the marginal distribution of one variable, say X, is very skewed. For example, if we are studying the relationship between high alcohol consumption and pancreatic cancer in the general population, the incidence of pancreatic cancer would be very low, so it would require a very large population sample to get a modest number of pancreatic cancer cases. However we could use data from hospitals to contact most or all of their pancreatic cancer patients, and then randomly sample an equal number of subjects without pancreatic cancer (this is called a "case-control study").


In both these settings, the odds ratio can be calculated from the selected sample, without biasing the results relative to what would have been obtained for a population sample.

Use in quantitative research

Due to the widespread use of logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...

, the odds ratio is widely used in many fields of medical and social science research. The odds ratio is commonly used in survey research
Survey research
Survey research a research method involving the use of questionnaires and/or statistical surveys to gather data about people and their thoughts and behaviours. This method was pioneered in the 1930s and 1940s by sociologist Paul Lazarsfeld. The initial use of the method was to examine the effects...

, in epidemiology
Epidemiology
Epidemiology is the study of health-event, health-characteristic, or health-determinant patterns in a population. It is the cornerstone method of public health research, and helps inform policy decisions and evidence-based medicine by identifying risk factors for disease and targets for preventive...

, and to express the results of some clinical trial
Clinical trial
Clinical trials are a set of procedures in medical research and drug development that are conducted to allow safety and efficacy data to be collected for health interventions...

s, such as in case-control studies. It is often abbreviated "OR" in reports. When data from multiple surveys is combined, it will often be expressed as "pooled OR".

Relation to relative risk

In clinical studies, as well as in some other settings, the parameter of greatest interest is often the relative risk
Relative risk
In statistics and mathematical epidemiology, relative risk is the risk of an event relative to exposure. Relative risk is a ratio of the probability of the event occurring in the exposed group versus a non-exposed group....

 rather than the odds ratio. The relative risk is best estimated using a population sample, but if the rare disease assumption holds, the odds ratio is a good approximation to the relative risk — the odds
Odds
The odds in favor of an event or a proposition are expressed as the ratio of a pair of integers, which is the ratio of the probability that an event will happen to the probability that it will not happen...

 is p / (1 − p), so when p moves towards zero, 1 − p moves towards 1, meaning that the odds approaches the risk, and the odds ratio approaches the relative risk. When the rare disease assumption does not hold, the odds ratio can overestimate the relative risk.

If the absolute risk in the control group is available, conversion between the two is calculated by::


where:
  • RR = relative risk
  • OR = odds ratio
  • RC = absolute risk in the unexposed group, given as a fraction (for example: fill in 10% risk as 0.1)

Invertible Property and Invariance of the Odds Ratio

The odds ratio has another unique property of being directly mathematically invertible whether analyzing the OR as either disease survival or disease onset incidence - where the OR for survival is direct reciprocal of 1/OR for risk. This is known as the 'invariance of the odds ratio'. In contrast, the relative risk does not possess this mathematical invertible property when studying disease survival vs. onset incidence. This phenomenon of OR invertibility vs. RR non-invertibility is best illustrated with an example:

Suppose in a clinical trial, one has an adverse event risk of 4/100 in drug group, and 2/100 in placebo... yielding a RR=2 and OR=2.04166 for drug-vs-placebo adverse risk. However, if analysis was inverted and adverse events were instead analyzed as event-free survival, then the drug group would have a rate of 96/100, and placebo group would have a rate of 98/100—yielding a drug-vs-placebo a RR=0.9796 for survival, but an OR=0.48979. As one can see, a RR of 0.9796 is clearly not the reciprocal of a RR of 2. In contrast, an OR of 0.48979 is indeed the direct reciprocal of an OR of 2.04166.

This is again what is called the 'invariance of the odds ratio', and why a RR for survival is not the same as a RR for risk, while the OR has this symmetrical property when analyzing either survival or adverse risk. The danger to clinical interpretation for the OR comes when the adverse event rate is not rare, thereby over-exaggerating differences when the OR rare-disease assumption is not met. On the other hand, when the disease is rare, using a RR for survival (e.g. the RR=0.9796 from above example) can clinically hide and conceal an important doubling of adverse risk associated with a drug or exposure.

Alternative estimators of the odds ratio

The sample odds ratio n11n00 / n10n01 is easy to calculate, and for moderate and large samples performs well as an estimator of the population odds ratio. When one or more of the cells in the contingency table can have a small value, the sample odds ratio can be biased
Bias (statistics)
A statistic is biased if it is calculated in such a way that it is systematically different from the population parameter of interest. The following lists some types of, or aspects of, bias which should not be considered mutually exclusive:...

 and exhibit high variance
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

. A number of alternative estimators of the odds ratio have been proposed to address this issue. One alternative estimator is the conditional maximum likelihood estimator, which conditions on the row and column margins when forming the likelihood to maximize (as in Fisher's exact test
Fisher's exact test
Fisher's exact test is a statistical significance test used in the analysis of contingency tables where sample sizes are small. It is named after its inventor, R. A...

). Another alternative estimator is the Mantel-Haenszel estimator.

Numerical examples

The following four contingency tables contain observed cell counts, along with the corresponding sample odds ratio (OR) and sample log odds ratio (LOR):
OR = 1, LOR = 0 OR = 1, LOR = 0 OR = 4, LOR = 1.39 OR = 0.25, LOR = −1.39
Y = 1 Y = 0 Y = 1 Y = 0 Y = 1 Y = 0 Y = 1 Y = 0
X = 1 10 10 100 100 20 10 10 20
X = 0 5 5 50 50 10 20 20 10


The following joint probability distributions contain the population cell probabilities, along with the corresponding population odds ratio (OR) and population log odds ratio (LOR):
OR = 1, LOR = 0 OR = 1, LOR = 0 OR = 16, LOR = 2.77 OR = 0.67, LOR = −0.41
Y = 1 Y = 0 Y = 1 Y = 0 Y = 1 Y = 0 Y = 1 Y = 0
X = 1 0.2 0.2 0.4 0.4 0.4 0.1 0.1 0.3
X = 0 0.3 0.3 0.1 0.1 0.1 0.4 0.2 0.4

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK