Rule of succession
Encyclopedia
In probability theory
, the rule of succession is a formula introduced in the 18th century by Pierre-Simon Laplace
in the course of treating the sunrise problem
.
The formula is still used, particularly to estimate underlying probabilities when there are few observations, or for events which have not been observed to occur at all in (finite) sample data. Assigning events a zero probability would contravene Cromwell's rule
, which can never be strictly justified in physical situations, albeit sometimes has to be assumed in practice.
More abstractly: If X1, ..., Xn+1 are conditionally independent
random variable
s that each can assume the value 0 or 1, then, if we know nothing more about them,
.
s) with s+1 successes. Beware: although this may seem the simplest and most reasonable assumption, which also happens to be true, so is a useful mnemonic, it still requires a proof! Indeed, assuming a pseudocount of one per possibility is one way to generalise the binary result, but has unexpected consequences — see Generalization to any number of possibilities, below.
Nevertheless, if we had not known from the start that both success and failure are possible, then we would have had to assign
.
But see Mathematical details, below, for an analysis of its validity. In particular it is not valid when , or .
If the number of observations increases, and get more and more similar, which is intuitively clear: the more data we have, the less importance should be assigned to our prior information.
However, as the mathematical details below show, the basic assumption for using the rule of succession would be that we have no prior knowledge about the question whether the sun will or will not rise tomorrow, except that it can do either. This assumption is of course complete nonsense if we are talking about sunrises!
Laplace knew this well, and himself wrote to conclude the sunrise example: “But this number is far greater for him who, seeing in the totality of phenomena the principle regulating the days and seasons, realises that nothing at the present moment can arrest the course of it.” Yet Laplace was ridiculed for this calculation; his opponents gave no heed to that sentence, or failed to understand its importance.
Let Xi be the number of "successes" on the ith trial
, with probability p of success on each trial. Thus each X is 0 or 1; each X has a Bernoulli distribution. Suppose these Xs are conditionally independent
given p.
Bayes' theorem
says that in order to get the conditional probability distribution of p given the data Xi, i = 1, ..., n, one multiplies the "prior
" (i.e., marginal) probability measure assigned to p by the likelihood function
where s = x1 + ... + xn is the number of "successes" and n is of course the number of trials, and then normalizes
, to get the "posterior" (i.e., conditional on the data) probability distribution of p. (We are using capital X to denote a random variable and lower-case x either as the dummy in the definition of a function or as the data actually observed.)
The prior probability density function
that expresses total ignorance of p except for the certain knowledge that it is neither 1 nor 0 (i.e., that we know that the experiment can in fact succeed or fail) is equal to 1 for 0 < p < 1 and equal to 0 otherwise. To get the normalizing constant, we find
(see beta function for more on integrals of this form).
The posterior probability density function is therefore
.
This is a beta distribution with expected value
Since the conditional probability for success in the next experiment, given the value of p, is just p, the law of total probability
tell us that the probability of success in the next experiment is just the expected value of p. Since all of this is conditional on the observed data Xi for i = 1, ..., n, we have
.
The same calculation can be performed with the prior that expresses total ignorance of p, including ignorance with regards to the question whether the experiment can succeed, or can fail. This prior, except for a normalizing constant, is 1/(p(1 − p)) for 0 ≤ p ≤ 1 and 0 otherwise. If the calculation above is repeated with this prior, we get
.
Thus, with the prior specifying total ignorance, the probability of success is governed by the observed frequency of success. However, the posterior distribution which led to this result is the Beta(s,n-s) distribution, which will not be proper when s=n or s=0 (i.e. the normalisation constant is infinite when s=0 or s=n). This means that we cannot use this form of the posterior distribution to calculate the probability of the next observation being a success when s=0 or s=n. This puts the information contained in the rule of succession in greater light: it can be thought of as expressing the prior assumption that if sampling was continued indefinitely, we would eventually observe at least one success, and at least one failure in the sample. The prior expressing total ignorance does not assume this knowledge.
To evaluate the "complete ignorance" case when s=0 or s=n can be dealt with by first going back to the hypergeometric distribution, denoted by . This is the approach taken in Jaynes(2003). The binomial can be derived as a limiting form, where in such a way that their ratio remains fixed. One can think of as the number of successes in the total population, of size
The equivalent prior to is , with a domain of . Working conditional to means that estimating is equivalent to estimating , and then dividing this estimate by . The posterior for can be given as:
And it can be seen that, if s=n or s=0, then one of the factorials in the numerator will cancel exactly with one in the denominator. Taking the s=0 case, we have:
Adding in the normalising constant, which is always finite (because there is no singularities in the range of the posterior, and there are a finite number of terms) gives:
So the posterior expectation for is:
An approximate analytical expression for large N is given by first making the approximation to the product term:
and then replacing the summation in the numerator with an integral
The same procedure is followed for the denominator, but the process is a bit more tricky, as the integral is harder to evaluate
Where ln(.) is the natural logarithm
plugging in these approximations into the expectation gives
Where the base 10 logarithm
has been used in the final answer for ease of calculation. For instance if the population is of size Nk then probability of success on the next sample is given by:
So for example, if the population be on the order of tens of billions, so that k=10, and we observe n=10 results without success, then the expected proportion in the population is approximately 0.43%. If the population is smaller, so that n=10, k=5 (tens of thousands), the expected proportion rises to approximately 0.86%, and so on. Similarly, if the number of observations is smaller, so that n=5,k=10, the proportion rise to approximately 0.86% again.
This probability has no lower bound, and can be made arbitrarily small for larger and larger choices of N, or k. This means that the probability depends on the size of the population from which one is sampling. In passing to the limit of infinite N (for the simpler analytic properties) we are "throwing away" a piece of very important information. Note that this ignorance relationship will only hold as long as only no successes are observed. It will be correspondingly revised straight back up to the observed frequency rule as soon as 1 success is observed. The corresponding results are found for the s=n case by switching labels, and then subtracting the probability from 1.
The rule of succession has many different intuitive interpretations, and depending on which intuition one uses, the generalisation may be different. Thus, the way to proceed from here is very carefully, and to re-derive the results from first principles, rather than to introduce an intuitively sensible generalisation. The full derivation can be found in Jaynes' book, but it does admit an easier to understand alternative derivation, once the solution is known. Another point to emphasise is that the prior state of knowledge described by the rule of succession is given as an enumeration of the possibilities, with the additional information that it is possible for each category to be observed. This can be equivalently stated as observing each category once prior to gathering the data. To denote that this is the knowledge being used, an Im will be put as part of the conditions in the probability assignments.
The rule of succession comes from setting a binomial likelihood, and a uniform prior distribution. Thus a straight forward generalisation is just the multivariate extensions of these two distributions: 1)Setting a uniform prior over the initial m categories, and 2) using the multinomial distribution as the likelihood function (which is the multivariate generalisation of the binomial distribution). It can be shown that the uniform distribution is a special case of the Dirichlet distribution with all of its parameters equal to 1 (just as the uniform is Beta(1,1) in the binary case). The Dirichlet distribution is the conjugate prior
for the multinomial distribution, which means that the posterior distribution will also be a Dirichlet distribution with different parameters. Let pi denote the probability that category i will be observed, and let ni denote the number of times category i (i=1,...,m) actually was observed. Then the joint posterior distribution of the probabilities p1,...,pm is given by;
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...
, the rule of succession is a formula introduced in the 18th century by Pierre-Simon Laplace
Pierre-Simon Laplace
Pierre-Simon, marquis de Laplace was a French mathematician and astronomer whose work was pivotal to the development of mathematical astronomy and statistics. He summarized and extended the work of his predecessors in his five volume Mécanique Céleste...
in the course of treating the sunrise problem
Sunrise problem
The sunrise problem can be expressed as follows: "What is the probability that the sun will rise tomorrow?"The sunrise problem illustrates the difficulty of using probability theory when evaluating the plausibility of statements or beliefs....
.
The formula is still used, particularly to estimate underlying probabilities when there are few observations, or for events which have not been observed to occur at all in (finite) sample data. Assigning events a zero probability would contravene Cromwell's rule
Cromwell's rule
Cromwell's rule, named by statistician Dennis Lindley, states that one should avoid using prior probabilities of 0 or 1, except when applied to statements that are logically true or false...
, which can never be strictly justified in physical situations, albeit sometimes has to be assumed in practice.
Statement of the rule of succession
If we repeat an experiment that we know can result in a success or failure, n times independently, and get s successes, then what is the probability that the next repetition will again be a success?More abstractly: If X1, ..., Xn+1 are conditionally independent
Conditional independence
In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability distribution given Y...
random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
s that each can assume the value 0 or 1, then, if we know nothing more about them,
.
Interpretation
Since we have the prior knowledge that we are looking at an experiment for which both success and failure are possible, our estimate is as if we had observed one success and one failure for sure before we even started the experiments. In a sense we made n+2 observations (known as pseudocountPseudocount
A pseudocount is an amount added to the number of observed cases in order to change the expected probability in a model of those data, when not known to be zero. Depending on the prior knowledge, which is sometimes a subjective value, a pseudocount may have any non-negative finite value...
s) with s+1 successes. Beware: although this may seem the simplest and most reasonable assumption, which also happens to be true, so is a useful mnemonic, it still requires a proof! Indeed, assuming a pseudocount of one per possibility is one way to generalise the binary result, but has unexpected consequences — see Generalization to any number of possibilities, below.
Nevertheless, if we had not known from the start that both success and failure are possible, then we would have had to assign
.
But see Mathematical details, below, for an analysis of its validity. In particular it is not valid when , or .
If the number of observations increases, and get more and more similar, which is intuitively clear: the more data we have, the less importance should be assigned to our prior information.
Historical application to the sunrise problem
Laplace used the rule of succession to calculate the probability that the sun will rise tomorrow, given that it has risen every day for the past 5000 years. One obtains a very large factor of approximately 5000 × 365.25, which gives odds of 1826250:1 in favour of the sun rising tomorrow.However, as the mathematical details below show, the basic assumption for using the rule of succession would be that we have no prior knowledge about the question whether the sun will or will not rise tomorrow, except that it can do either. This assumption is of course complete nonsense if we are talking about sunrises!
Laplace knew this well, and himself wrote to conclude the sunrise example: “But this number is far greater for him who, seeing in the totality of phenomena the principle regulating the days and seasons, realises that nothing at the present moment can arrest the course of it.” Yet Laplace was ridiculed for this calculation; his opponents gave no heed to that sentence, or failed to understand its importance.
Mathematical details
The proportion p is assigned a uniform distribution to describe the uncertainty about its true value. (Note: This proportion is not random, but uncertain. We assign a probability distribution to p to express our uncertainty, not to attribute randomness to p. But this amounts mathematically to the same thing as treating p as if it was random)Let Xi be the number of "successes" on the ith trial
Bernoulli trial
In the theory of probability and statistics, a Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure"....
, with probability p of success on each trial. Thus each X is 0 or 1; each X has a Bernoulli distribution. Suppose these Xs are conditionally independent
Conditional independence
In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability distribution given Y...
given p.
Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....
says that in order to get the conditional probability distribution of p given the data Xi, i = 1, ..., n, one multiplies the "prior
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
" (i.e., marginal) probability measure assigned to p by the likelihood function
Likelihood function
In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values...
where s = x1 + ... + xn is the number of "successes" and n is of course the number of trials, and then normalizes
Normalizing constant
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics.-Definition and examples:In probability theory, a normalizing constant is a constant by which an everywhere non-negative function must be multiplied so the area under its graph is 1, e.g.,...
, to get the "posterior" (i.e., conditional on the data) probability distribution of p. (We are using capital X to denote a random variable and lower-case x either as the dummy in the definition of a function or as the data actually observed.)
The prior probability density function
Probability density function
In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...
that expresses total ignorance of p except for the certain knowledge that it is neither 1 nor 0 (i.e., that we know that the experiment can in fact succeed or fail) is equal to 1 for 0 < p < 1 and equal to 0 otherwise. To get the normalizing constant, we find
(see beta function for more on integrals of this form).
The posterior probability density function is therefore
.
This is a beta distribution with expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
-
- .
Since the conditional probability for success in the next experiment, given the value of p, is just p, the law of total probability
Law of total probability
In probability theory, the law of total probability is a fundamental rule relating marginal probabilities to conditional probabilities.-Statement:The law of total probability is the proposition that if \left\...
tell us that the probability of success in the next experiment is just the expected value of p. Since all of this is conditional on the observed data Xi for i = 1, ..., n, we have
.
The same calculation can be performed with the prior that expresses total ignorance of p, including ignorance with regards to the question whether the experiment can succeed, or can fail. This prior, except for a normalizing constant, is 1/(p(1 − p)) for 0 ≤ p ≤ 1 and 0 otherwise. If the calculation above is repeated with this prior, we get
.
Thus, with the prior specifying total ignorance, the probability of success is governed by the observed frequency of success. However, the posterior distribution which led to this result is the Beta(s,n-s) distribution, which will not be proper when s=n or s=0 (i.e. the normalisation constant is infinite when s=0 or s=n). This means that we cannot use this form of the posterior distribution to calculate the probability of the next observation being a success when s=0 or s=n. This puts the information contained in the rule of succession in greater light: it can be thought of as expressing the prior assumption that if sampling was continued indefinitely, we would eventually observe at least one success, and at least one failure in the sample. The prior expressing total ignorance does not assume this knowledge.
To evaluate the "complete ignorance" case when s=0 or s=n can be dealt with by first going back to the hypergeometric distribution, denoted by . This is the approach taken in Jaynes(2003). The binomial can be derived as a limiting form, where in such a way that their ratio remains fixed. One can think of as the number of successes in the total population, of size
The equivalent prior to is , with a domain of . Working conditional to means that estimating is equivalent to estimating , and then dividing this estimate by . The posterior for can be given as:
And it can be seen that, if s=n or s=0, then one of the factorials in the numerator will cancel exactly with one in the denominator. Taking the s=0 case, we have:
Adding in the normalising constant, which is always finite (because there is no singularities in the range of the posterior, and there are a finite number of terms) gives:
So the posterior expectation for is:
An approximate analytical expression for large N is given by first making the approximation to the product term:
and then replacing the summation in the numerator with an integral
The same procedure is followed for the denominator, but the process is a bit more tricky, as the integral is harder to evaluate
Where ln(.) is the natural logarithm
Natural logarithm
The natural logarithm is the logarithm to the base e, where e is an irrational and transcendental constant approximately equal to 2.718281828...
plugging in these approximations into the expectation gives
Where the base 10 logarithm
Logarithm
The logarithm of a number is the exponent by which another fixed value, the base, has to be raised to produce that number. For example, the logarithm of 1000 to base 10 is 3, because 1000 is 10 to the power 3: More generally, if x = by, then y is the logarithm of x to base b, and is written...
has been used in the final answer for ease of calculation. For instance if the population is of size Nk then probability of success on the next sample is given by:
So for example, if the population be on the order of tens of billions, so that k=10, and we observe n=10 results without success, then the expected proportion in the population is approximately 0.43%. If the population is smaller, so that n=10, k=5 (tens of thousands), the expected proportion rises to approximately 0.86%, and so on. Similarly, if the number of observations is smaller, so that n=5,k=10, the proportion rise to approximately 0.86% again.
This probability has no lower bound, and can be made arbitrarily small for larger and larger choices of N, or k. This means that the probability depends on the size of the population from which one is sampling. In passing to the limit of infinite N (for the simpler analytic properties) we are "throwing away" a piece of very important information. Note that this ignorance relationship will only hold as long as only no successes are observed. It will be correspondingly revised straight back up to the observed frequency rule as soon as 1 success is observed. The corresponding results are found for the s=n case by switching labels, and then subtracting the probability from 1.
Generalization to any number of possibilities
This section gives a heuristic derivation to that given in Probability Theory: The Logic of Science.The rule of succession has many different intuitive interpretations, and depending on which intuition one uses, the generalisation may be different. Thus, the way to proceed from here is very carefully, and to re-derive the results from first principles, rather than to introduce an intuitively sensible generalisation. The full derivation can be found in Jaynes' book, but it does admit an easier to understand alternative derivation, once the solution is known. Another point to emphasise is that the prior state of knowledge described by the rule of succession is given as an enumeration of the possibilities, with the additional information that it is possible for each category to be observed. This can be equivalently stated as observing each category once prior to gathering the data. To denote that this is the knowledge being used, an Im will be put as part of the conditions in the probability assignments.
The rule of succession comes from setting a binomial likelihood, and a uniform prior distribution. Thus a straight forward generalisation is just the multivariate extensions of these two distributions: 1)Setting a uniform prior over the initial m categories, and 2) using the multinomial distribution as the likelihood function (which is the multivariate generalisation of the binomial distribution). It can be shown that the uniform distribution is a special case of the Dirichlet distribution with all of its parameters equal to 1 (just as the uniform is Beta(1,1) in the binary case). The Dirichlet distribution is the conjugate prior
Conjugate prior
In Bayesian probability theory, if the posterior distributions p are in the same family as the prior probability distribution p, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood...
for the multinomial distribution, which means that the posterior distribution will also be a Dirichlet distribution with different parameters. Let pi denote the probability that category i will be observed, and let ni denote the number of times category i (i=1,...,m) actually was observed. Then the joint posterior distribution of the probabilities p1,...,pm is given by;
-
To get the generalised rule of succession, note that the probability of observing category i on the next observation, conditional on the pi is just pi, we simply require its expectation. Letting Ai denote the event that the next observation is in category i (i=1,...,m), and let n=n1+...+nm be the total number of observations made. The result, using the properties of the dirichlet distribution is:
Note that this solution reduces to the probability which would be assigned using the principle of indifference before any observations made (i.e. n=0), consistent with the original rule of succession. It also contains the rule of succession as a special case, when m=2, as a generalisation should.
Because the propositions or events Ai are mutually exclusive, it is possible to collapse the m categories into 2. Simply add up the Ai probabilities which correspond to "success" to get the probability of success. Supposing that this aggregates c categories as "success" and m-c categories as "failure". Let s denote the sum of the relevant ni values which have been termed "success". The probability of "success" at the next trial is then:
Which is different from the original rule of succession. But note that the original rule of succession is based on I2, whereas the generalisation is based on Im. This means that the information contained in Im is different to that contained in I2. This indicates that mere knowledge of more than two outcomes which we are sure are possible is relevant information when collapsing these categories down into just two. This illustrates the subtlety in describing the prior information, and why it is important to specify which prior information one is using.
Further analysis
A good model is essential (i.e., a good compromise between accuracy and practicality). To paraphrase Laplace on the sunrise problemSunrise problemThe sunrise problem can be expressed as follows: "What is the probability that the sun will rise tomorrow?"The sunrise problem illustrates the difficulty of using probability theory when evaluating the plausibility of statements or beliefs....
: Although we have a huge number of samples of the sun rising, there are far better models of the sun than assuming it has a certain probability of rising each day, e.g., simply having a half-life.
Given a good model, it is best to make as many observations as practicable, depending of the expected reliability of prior knowledge, cost of observations, time and resources available, and accuracy required.
One of the most difficult aspects of the rule of succession is not the mathematical formulas, but answering the question: When does the rule of succession apply? In the generalisation section, it was noted very explicitly by adding the prior information Im into the calculations. Thus when all that is known about a phenomenon is that there are m outcomes which are known to be possible prior to observing any data, then and only then, does the rule of succession apply. If the rule of succession is applied in problems where this does not accurately describe the prior state of knowledge, then it may give counter-intuitive results. This is not because the rule of succession is defective, but that it is effectively answering a different question, based on different prior information.
In principle (see Cromwell's ruleCromwell's ruleCromwell's rule, named by statistician Dennis Lindley, states that one should avoid using prior probabilities of 0 or 1, except when applied to statements that are logically true or false...
), no possibility should have its probability (or its pseudocount) set to zero, since nothing in the physical world should be assumed to be strictly impossible (though it may be) – even if it would be contrary to all observations to date and current theories. Indeed, Bayes rule takes absolutely no account of an observation that was previously believed to have zero probability — it is still declared impossible. However, only considering the a fixed set of the possibilities is an acceptable route, one just needs to remember that the results are conditional on (or restricted to) the set being considered, and not some "universal" set. In fact Larry Bretthorst shows that including the possibility of "something else" into the hypothesis space makes no difference to the relative probabilities of the other hypothesis - it simply renormalises them to add up to a value less than 1. Until "something else" is specified, the likelihood function conditional on this "something else" is indeterminate, for how is one to determine ?. Thus no updating of the prior probability for "something else" can occur until it is more accurately defined.
However, it is sometimes debatable whether prior knowledge should affect the relative probabilities, or also the total weight of the prior knowledge compared to actual observations. This does not have a clear cut answer, for it depends on what prior knowledge one is considering. In fact, an alternative prior state of knowledge could be of the form "I have specified m potential categories, but I am sure that only one of them is possible prior to observing the data. However, I do not know which particular category this is." A mathematical way to describe this prior is the dirichlet distribution with all parameters equal to m−1, which then gives a pseudocount of 1 to the denominator instead of m, and adds a pseudocount of m−1 to each category. This gives a slightly different probability in the binary case of .
Prior probabilities are only worth spending significant effort estimating when likely to have significant effect. They may be important when there are few observations — especially when so few that there have been few, if any, observations of some possibilities – such as a rare animal, in a given region. Also important when there are many observations, where it is believed that the expectation should be heavily weighted towards the prior estimates, in spite of many observations to the contrary, such as for a roulette wheel in a well-respected casino. In the latter case, at least some of the pseudocountPseudocountA pseudocount is an amount added to the number of observed cases in order to change the expected probability in a model of those data, when not known to be zero. Depending on the prior knowledge, which is sometimes a subjective value, a pseudocount may have any non-negative finite value...
s may need to be very large; they are not always small, and thereby soon outweighed by actual observations, as is often assumed. However, although a last resort, for everyday purposes, prior knowledge is usually vital. So most decisions must be subjective to some extent (dependent upon the analyst and analysis used).
See also
- Additive smoothingAdditive smoothingIn statistics, additive smoothing, also called Laplace smoothing , or Lidstone smoothing, is a technique used to smooth categorical data...
- Krichevsky–Trofimov estimator
- principle of indifferencePrinciple of indifferenceThe principle of indifference is a rule for assigning epistemic probabilities.Suppose that there are n > 1 mutually exclusive and collectively exhaustive possibilities....
- Additive smoothing