Kullback's inequality - AbsoluteAstronomy.com

Information theory

Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, Kullback's inequality is a lower bound on the Kullback–Leibler divergence

Kullback–Leibler divergence

In probability theory and information theory, the Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q...

expressed in terms of the large deviations

Large deviations theory

In probability theory, the theory of large deviations concerns the asymptotic behaviour of remote tails of sequences of probability distributions. Some basic ideas of the theory can be tracked back to Laplace and Cramér, although a clear unified formal definition was introduced in 1966 by Varadhan...

rate function

Rate function

In mathematics — specifically, in large deviations theory — a rate function is a function used to quantify the probabilities of rare events. It is required to have several "nice" properties which assist in the formulation of the large deviation principle...

. If P and Q are probability distribution

Probability distribution

In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

s on the real line, such that P is absolutely continuous with respect to Q, i.e. P<<Q, and whose first moments exist, then

where

is the rate function, i.e. the convex conjugate

Convex conjugate

In mathematics, convex conjugation is a generalization of the Legendre transformation. It is also known as Legendre–Fenchel transformation or Fenchel transformation .- Definition :...

of the cumulant

Cumulant

In probability theory and statistics, the cumulants κn of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. The moments determine the cumulants in the sense that any two probability distributions whose moments are identical will have...

-generating function, of

, and

is the first moment

Moment (mathematics)

In mathematics, a moment is, loosely speaking, a quantitative measure of the shape of a set of points. The "second moment", for example, is widely used and measures the "width" of a set of points in one dimension or in higher dimensions measures the shape of a cloud of points as it could be fit by...

The Cramér–Rao bound is a corollary of this result.

Proof

Let P and Q be probability distribution

Probability distribution

In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

s (measures) on the real line, whose first moments exist, and such that P<<Q. Consider the natural exponential family
Natural exponential family
In probability and statistics, the natural exponential family is a class of probability distributions that is a special case of an exponential family...

of Q given by

for every measurable set A, where

is the moment-generating function
Moment-generating function
In probability theory and statistics, the moment-generating function of any random variable is an alternative definition of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compared with working directly with probability density functions or...

of Q. (Note that Q₀=Q.) Then

By Gibbs' inequality

Gibbs' inequality

In information theory, Gibbs' inequality is a statement about the mathematical entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality....

we have

so that

Simplifying the right side, we have, for every real θ where

where

is the first moment, or mean, of P, and

is called the cumulant-generating function
Cumulant
In probability theory and statistics, the cumulants κn of a probability distribution are a set of quantities that provide an alternative to the moments of the distribution. The moments determine the cumulants in the sense that any two probability distributions whose moments are identical will have...

. Taking the supremum completes the process of convex conjugation

Convex conjugate

In mathematics, convex conjugation is a generalization of the Legendre transformation. It is also known as Legendre–Fenchel transformation or Fenchel transformation .- Definition :...

and yields the rate function

Rate function

Start with Kullback's inequality

Let X_θ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then

where

is the convex conjugate

Convex conjugate

In mathematics, convex conjugation is a generalization of the Legendre transformation. It is also known as Legendre–Fenchel transformation or Fenchel transformation .- Definition :...

of the cumulant-generating function

Cumulant

and

is the first moment of

Left side

The left side of this inequality can be simplified as follows:

where we have expanded the logarithm

in a Taylor series

Taylor series

In mathematics, a Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function's derivatives at a single point....

which is half the Fisher information

Fisher information

In mathematical statistics and information theory, the Fisher information is the variance of the score. In Bayesian statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior...

of the parameter θ.