Mutual information
Encyclopedia
In probability theory
Probability theory
Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

 and information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...

, the mutual information (sometimes known by the archaic
Archaism
In language, an archaism is the use of a form of speech or writing that is no longer current. This can either be done deliberately or as part of a specific jargon or formula...

 term transinformation) of two random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s is a quantity that measures the mutual dependence of the two random variables. The most common unit of measurement of mutual information is the bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

, when logarithms to the base 2 are used.

Definition of mutual information

Formally, the mutual information of two discrete random variables X and Y can be defined as:


where p(x,y) is the joint probability distribution function
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

 of X and Y, and and are the marginal probability distribution functions of X and Y respectively.

In the case of continuous random variables
Continuous function
In mathematics, a continuous function is a function for which, intuitively, "small" changes in the input result in "small" changes in the output. Otherwise, a function is said to be "discontinuous". A continuous function with a continuous inverse function is called "bicontinuous".Continuity of...

, the summation is matched with a definite double integral:


where p(x,y) is now the joint probability density function of X and Y, and and are the marginal probability density functions of X and Y respectively.

These definitions are ambiguous because the base of the log function is not specified. To disambiguate, the function I could be parameterized as I(X,Y,b) where b is the base. Alternatively, since the most common unit of measurement of mutual information is the bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

, a base of 2 could be specified.

Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa. As a result, in the case of identity the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

 of Y (or X: clearly if X and Y are identical they have equal entropy).

Mutual information quantifies the dependence between the joint distribution
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

 of X and Y and what the joint distribution would be if X and Y were independent.
Mutual information is a measure of dependence in the following sense: I(X; Y) = 0 if and only if
If and only if
In logic and related fields such as mathematics and philosophy, if and only if is a biconditional logical connective between statements....

 X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore:


Moreover, mutual information is nonnegative (i.e. I(X;Y) ≥ 0; see below) and symmetric
Symmetric function
In algebra and in particular in algebraic combinatorics, the ring of symmetric functions, is a specific limit of the rings of symmetric polynomials in n indeterminates, as n goes to infinity...

 (i.e. I(X;Y) = I(Y;X)).

Relation to other quantities

Mutual information can be equivalently expressed as


where H(X) and H(Y) are the marginal entropies
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...

, H(X|Y) and H(Y|X) are the conditional entropies
Conditional entropy
In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...

, and H(X,Y) is the joint entropy of X and Y. Since H(X) ≥ H(X|Y), this characterization is consistent with the nonnegativity property stated above.

Intuitively, if entropy H(X) is regarded as a measure of uncertainty about a random variable, then H(X|Y) is a measure of what Y does not say about X. This is "the amount of uncertainty remaining about X after Y is known", and thus the right side of the first of these equalities can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.

Note that in the discrete case H(X|X) = 0 and therefore H(X) = I(X;X). Thus I(X;X) ≥ I(X;Y), and one can formulate the basic principle that a variable contains at least as much information about itself as any other variable can provide.

Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distribution
Marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. The term marginal variable is used to refer to those variables in the subset of variables being retained...

s of the two random variables X and Y, from p(x,y) the random variables' joint distribution
Joint distribution
In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y...

:


Furthermore, let p(x|y) = p(x, y) / p(y). Then


Thus mutual information can also be understood as the expectation
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

 of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution
Conditional distribution
Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value...

 p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain.

Variations of mutual information

Several variations on mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.

Metric

Many applications require a metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

, that is, a distance measure between points. The quantity


satisfies the properties of a metric (triangle inequality
Triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side ....

, non-negativity, indiscernability
Identity of indiscernibles
The identity of indiscernibles is an ontological principle which states that two or more objects or entities are identical if they have all their properties in common. That is, entities x and y are identical if any predicate possessed by x is also possessed by y and vice versa...

 and symmetry). This distance metric is also known as the Variation of information.

Since one has , a natural normalized variant is


The metric D is a universal metric, in that if any other distance measure places X and Y close-by, then the D will also judge them close.

A set-theoretic interpretation of information (see the figure for Conditional entropy
Conditional entropy
In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...

) shows that


which is effectively the Jaccard distance
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....

 between X and Y.

Conditional mutual information

Sometimes it is useful to express the mutual information of two random variables conditioned on a third.
which can be simplified as
Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that
for discrete, jointly distributed random variables X, Y, Z. This result has been used as a basic building block for proving other inequalities in information theory
Inequalities in information theory
Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.-Shannon-type inequalities:...

.

Multivariate mutual information

Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation
Total correlation
In probability theory and in particular in information theory, total correlation is one of several generalizations of the mutual information. It is also known as the multivariate constraint or multiinformation...

 and interaction information
Interaction information
The interaction information or co-information is one of several generalizations of the mutual information, and expresses the amount information bound up in a set of variables, beyond that which is present in any subset of those variables...

. If Shannon entropy is viewed as a signed measure
Signed measure
In mathematics, signed measure is a generalization of the concept of measure by allowing it to have negative values. Some authors may call it a charge, by analogy with electric charge, which is a familiar distribution that takes on positive and negative values.-Definition:There are two slightly...

 in the context of information diagram
Information diagram
An information diagram is a type of Venn diagram used in information theory to illustrate relationships among Shannon's basic measures of information: entropy, joint entropy, conditional entropy and mutual information...

s, as explained in the article Information theory and measure theory
Information theory and measure theory
- Measures in information theory :Many of the formulas in information theory have separate versions for continuous and discrete cases, i.e. integrals for the continuous case and sums for the discrete case. These versions can often be generalized using measure theory...

, then the only definition of multivariate mutual information that makes sense is as follows:
and for
where (as above) we define
(This definition of multivariate mutual information is identical to that of interaction information
Interaction information
The interaction information or co-information is one of several generalizations of the mutual information, and expresses the amount information bound up in a set of variables, beyond that which is present in any subset of those variables...

 except for a change in sign when the number of random variables is odd.)

Applications

Applying information diagrams blindly to derive the above definition has been criticised, and indeed it has found rather limited practical application, since it is difficult to visualize or grasp the significance of this quantity for a large number of random variables. It can be zero, positive, or negative for any

One high-dimensional generalization scheme which maximizes the mutual information between the joint distribution and other target variables is found to be useful in feature selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...

.

Normalized variants

Normalized variants of the mutual information are provided by the coefficients of constraint (Coombs, Dawes and Tversky 1970) or uncertainty coefficient (Press & Flannery 1988)


The two coefficients are not necessarily equal. A more useful and symmetric scaled information measure is the redundancy


which attains a minimum of zero when the variables are independent and a maximum value of


when one variable becomes completely redundant with the knowledge of the other. See also Redundancy (information theory)
Redundancy (information theory)
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...

. Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by


which represents a weighted average of the two uncertainty coefficients (Press & Flannery 1988).

If we consider mutual information as a special case of the total correlation
Total correlation
In probability theory and in particular in information theory, total correlation is one of several generalizations of the mutual information. It is also known as the multivariate constraint or multiinformation...

 or dual total correlation
Dual total correlation
In information theory, dual total correlation or excess entropy is one of the two known non-negative generalizations of mutual information. While total correlation is bounded by the sum entropies of the n elements, the dual total correlation is bounded by the joint-entropy of the n elements...

, the normalized versions are respectively, and

Other normalized versions are provided by the following expressions (Yao 2003, Strehl & Ghosh 2002).


The quantity


is a metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

, i.e. satisfies the triangle inequality, etc. The metric is also a universal metric.

Weighted variants

In the traditional formulation of the mutual information,


each event or object specified by is weighted by the corresponding probability . This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.

For example, the deterministic mapping may be viewed as stronger than the deterministic mapping , although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)


which places a weight on the probability of each variable value co-occurrence, . This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for , , and would have the effect of assessing greater informativeness for the relation than for the relation , which may be desirable in some cases of pattern recognition, and the like. There has been little mathematical work done on the weighted mutual information and its properties, however.

Absolute mutual information

Using the ideas of Kolmogorov complexity
Kolmogorov complexity
In algorithmic information theory , the Kolmogorov complexity of an object, such as a piece of text, is a measure of the computational resources needed to specify the object...

, one can consider the mutual information of two sequences independent of any probability distribution:


To establish that this quantity is symmetric up to a logarithmic factor () requires the chain rule for Kolmogorov complexity
Chain rule for Kolmogorov complexity
The chain rule for Kolmogorov complexity is an analogue of the chain rule for information entropy, which states:H = H + HThat is, the combined randomness of two sequences X and Y is the sum of the randomness of X plus whatever randomness is left in Y once we know X.This follows immediately from the...

 .
Approximations of this quantity via compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

 can be used to define a distance measure
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...

 to perform a hierarchical clustering
Hierarchical clustering
In statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:...

 of sequences without having any domain knowledge
Domain knowledge
Domain knowledge is that valid knowledge used to refer to an area of human endeavour, an autonomous computer activity, or other specialized discipline.Specialists and experts use and develop their own domain knowledge...

 of the sequences .

Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy
Conditional entropy
In information theory, the conditional entropy quantifies the remaining entropy of a random variable Y given that the value of another random variable X is known. It is referred to as the entropy of Y conditional on X, and is written H...

. Examples include:
  • In telecommunications, the channel capacity
    Channel capacity
    In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...

     is equal to the mutual information, maximized over all input distributions.
  • Discriminative training procedures for hidden Markov model
    Hidden Markov model
    A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...

    s have been proposed based on the maximum mutual information (MMI) criterion.
  • RNA secondary structure
    Nucleic acid secondary structure
    The secondary structure of a nucleic acid molecule refers to the basepairing interactions within a single molecule or set of interacting molecules, and can be represented as a list of bases which are paired in a nucleic acid molecule....

     prediction from a multiple sequence alignment
    Multiple sequence alignment
    A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...

    .
  • Phylogenetic profiling
    Phylogenetic profiling
    Phylogenetic profiling is an important and elegant bioinformatics technique in which the joint presence or joint absence of two traits across a similar distribution of species is used to infer a meaningful biological connection, such as involvement of two different proteins in the same biological...

     prediction from pairwise present and disappearance of functionally link gene
    Gene
    A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

    s.
  • Mutual information has been used as a criterion for feature selection
    Feature selection
    In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...

     and feature transformations in machine learning
    Machine learning
    Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

    . It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection
    Minimum redundancy feature selection
    Minimum redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection as Minimum Redundancy Maximum Relevance...

    .
  • Mutual information is often used as a significance function for the computation of collocation
    Collocation
    In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...

    s in corpus linguistics
    Corpus linguistics
    Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

    .
  • Mutual information is used in medical imaging
    Medical imaging
    Medical imaging is the technique and process used to create images of the human body for clinical purposes or medical science...

     for image registration
    Image registration
    Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, from different times, or from different viewpoints. It is used in computer vision, medical imaging, military automatic target...

    . Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system
    Coordinate system
    In geometry, a coordinate system is a system which uses one or more numbers, or coordinates, to uniquely determine the position of a point or other geometric element. The order of the coordinates is significant and they are sometimes identified by their position in an ordered tuple and sometimes by...

     as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
  • Detection of phase synchronization
    Phase synchronization
    Phase synchronization is the process by which two or more cyclic signals tend to oscillate with a repeating sequence of relative phase angles.Phase synchronisation is usually applied to two waveforms of the same frequency with identical phase angles with each cycle...

     in time series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

     analysis
  • In the infomax
    Infomax
    Infomax is an optimization principle for neural networks and other information processing systems. It prescribes that a function that maps a set of input values I to a set of output values O should be chosen or learned so as to maximize the average Shannon mutual information between I and O,...

     method for neural-net and other machine learning, including the infomax-based Independent component analysis
    Independent component analysis
    Independent component analysis is a computational method for separating a multivariate signal into additive subcomponents supposing the mutual statistical independence of the non-Gaussian source signals...

     algorithm
  • Average mutual information in delay embedding theorem is used for determining the embedding delay parameter.
  • Mutual information between genes
    Gênes
    Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...

     in expression microarray
    Microarray
    A microarray is a multiplex lab-on-a-chip. It is a 2D array on a solid substrate that assays large amounts of biological material using high-throughput screening methods.Types of microarrays include:...

     data is used by the ARACNE
    ARACNE
    ARACNE is a method for reconstructing biological networks from microarray data developed at Columbia University. The method uses information theoretic methods to reduce false positives which are predicted through indirect interactions....

     algorithm for reconstruction of gene networks
    Gene regulatory network
    A gene regulatory network or genetic regulatory network is a collection of DNA segments in a cell whichinteract with each other indirectly and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA.In general, each mRNA molecule goes...

    .
  • In statistical mechanics
    Statistical mechanics
    Statistical mechanics or statistical thermodynamicsThe terms statistical mechanics and statistical thermodynamics are used interchangeably...

    , Loschmidt's paradox
    Loschmidt's paradox
    Loschmidt's paradox, also known as the reversibility paradox, is the objection that it should not be possible to deduce an irreversible process from time-symmetric dynamics...

     may be expressed in terms of mutual information. Loschmidt noted that it must be impossible to determine a physical law which lacks time reversal symmetry (e.g. the second law of thermodynamics
    Second law of thermodynamics
    The second law of thermodynamics is an expression of the tendency that over time, differences in temperature, pressure, and chemical potential equilibrate in an isolated physical system. From the state of thermodynamic equilibrium, the law deduced the principle of the increase of entropy and...

    ) only from physical laws which have this symmetry. He pointed out that the H-theorem
    H-theorem
    In Classical Statistical Mechanics, the H-theorem, introduced by Ludwig Boltzmann in 1872, describes the increase in the entropy of an ideal gas in an irreversible process. H-theorem follows from considerations of Boltzmann's equation...

     of Boltzmann made the assumption that the velocities of particles in a gas were permanently uncorrelated, which removed the time symmetry inherent in the H-theorem. It can be shown that if a system is described by a probability density in phase space
    Phase space
    In mathematics and physics, a phase space, introduced by Willard Gibbs in 1901, is a space in which all possible states of a system are represented, with each possible state of the system corresponding to one unique point in the phase space...

    , then Liouville's theorem
    Liouville's theorem
    Liouville's theorem has various meanings, all mathematical results named after Joseph Liouville:*In complex analysis, see Liouville's theorem ; there is also a related theorem on harmonic functions....

     implies that the joint information (negative of the joint entropy) of the distribution remains constant in time. The joint information is equal to the mutual information plus the sum of all the marginal informations (negative of the marginal entropies) for each particle coordinate. Boltzmann's assumption amounts to ignoring the mutual information in the calculation of entropy, which yields the thermodynamic entropy (divided by Boltzmann's constant).

  • The mutual information is used to learn the structure of Bayesian network
    Bayesian network
    A Bayesian network, Bayes network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph . For example, a Bayesian network could represent the probabilistic...

    s/dynamic Bayesian network
    Dynamic Bayesian network
    A dynamic Bayesian network is a Bayesian network that represents sequences of variables. These sequences are often time-series or sequences of symbols . The hidden Markov model can be considered as a simple dynamic Bayesian network.- References :* , Zoubin Ghahramani, Lecture Notes In Computer...

    s, which explain the causal relationship between random variables, as exemplified by the GlobalMIT toolkit http://code.google.com/p/globalmit/: learning the globally optimal dynamic Bayesian network with the Mutual Information Test criterion.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK