Covariance matrix - AbsoluteAstronomy.com

Probability theory

Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single...

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, a covariance matrix (also known as dispersion matrix) is a matrix

Matrix (mathematics)

In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...

whose element in the i, j position is the covariance

Covariance

In probability theory and statistics, covariance is a measure of how much two variables change together. Variance is a special case of the covariance when the two variables are identical.- Definition :...

between the i ^th and j ^th elements of a random vector (that is, of a vector of random variable

Random variable

In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

s). Each element of the vector is a scalar

Scalar (mathematics)

In linear algebra, real numbers are called scalars and relate to vectors in a vector space through the operation of scalar multiplication, in which a vector can be multiplied by a number to produce another vector....

random variable, either with a finite number of observed empirical values or with a finite or infinite number of potential values specified by a theoretical joint probability distribution of all the random variables. Intuitively, the covariance matrix generalizes the notion of variance

Variance

In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

to multiple dimensions. As an example, the variation in a collection of random points in two-dimensional space cannot be characterized fully by a single number, nor would the variances in the x and y directions contain all of the necessary information; a 2×2 matrix would be necessary to fully characterize the two-dimensional variation. Analogous to the fact that it is necessary to build a Hessian matrix

Hessian matrix

In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function; that is, it describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named...

to fully describe the concavity of a multivariate function, a covariance matrix is necessary to fully describe the variation in a distribution.

Definition

Throughout this article, boldfaced unsubscripted X and Y are used to refer to random vectors, and unboldfaced subscripted X_i and Y_i are used to refer to random scalars. If the entries in the column vector

\mathbf{X} = \begin{bmatrix}X_1 \\  \vdots \\ X_n \end{bmatrix}

are random variable

Random variable

s, each with finite variance

Variance

, then the covariance matrix Σ is the matrix whose (i, j) entry is the covariance

Covariance

\Sigma_{ij} \mathrm{cov}(X_i, X_j) = \mathrm{E}\begin{bmatrix}
(X_i - \mu_i)(X_j - \mu_j)
\end{bmatrix}

where NEWLINE

NEWLINE

NEWLINE \mu_i = \mathrm{E}(X_i)\, is the expected value

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

of the ith entry in the vector X. In other words, we have NEWLINE

NEWLINE

NEWLINE \Sigma \begin{bmatrix} \mathrm{E}[(X_1 - \mu_1)(X_1 - \mu_1)] & \mathrm{E}[(X_1 - \mu_1)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_1 - \mu_1)(X_n - \mu_n)] \\ \\ \mathrm{E}[(X_2 - \mu_2)(X_1 - \mu_1)] & \mathrm{E}[(X_2 - \mu_2)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_2 - \mu_2)(X_n - \mu_n)] \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \mathrm{E}[(X_n - \mu_n)(X_1 - \mu_1)] & \mathrm{E}[(X_n - \mu_n)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_n - \mu_n)(X_n - \mu_n)] \end{bmatrix}. The inverse of this matrix,

\Sigma^{-1}

, is the inverse covariance matrix, also known as the concentration matrix or precision matrix. The elements of the precision matrix have an interpretation in terms of partial correlation

Partial correlation

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed.-Formal definition:...

s and partial variances.

Generalization of the variance

The definition above is equivalent to the matrix equality

\Sigma=\mathrm{E}
\left[
 \left(
 \textbf{X} - \mathrm{E}[\textbf{X}]
 \right)
 \left(
 \textbf{X} - \mathrm{E}[\textbf{X}]
 \right)^\top
\right]

This form can be seen as a generalization of the scalar-valued variance

Variance

to higher dimensions. Recall that for a scalar-valued random variable X

\sigma^2 = \mathrm{var}(X)

Conflicting nomenclatures and notations

Nomenclatures differ. Some statisticians, following the probabilist William Feller

William Feller

William Feller born Vilibald Srećko Feller , was a Croatian-American mathematician specializing in probability theory.-Early life and education:...

, call this matrix the variance of the random vector

X

, because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector

X

. Thus

\operatorname{var}(\textbf{X})
\mathrm{E}
\left[
 (\textbf{X} - \mathrm{E} [\textbf{X}])
 (\textbf{X} - \mathrm{E} [\textbf{X}])^\top
\right].

However, the notation for the cross-covariance between two vectors is standard:

\operatorname{cov}(\textbf{X},\textbf{Y})
\mathrm{E}
\left[
 (\textbf{X} - \mathrm{E}[\textbf{X}])
 (\textbf{Y} - \mathrm{E}[\textbf{Y}])^\top
\right].

The var notation is found in William Feller's two-volume book An Introduction to Probability Theory and Its Applications, but both forms are quite standard and there is no ambiguity between them. The matrix

\Sigma

is also often called the variance-covariance matrix since the diagonal terms are in fact variances.

Properties

For

\Sigma=\mathrm{E} \left[ \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right) \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right)^\top \right]

and

\mu = \mathrm{E}(\textbf{X})

, where X is a random p-dimensional variable and Y a random q-dimensional variable, the following basic properties apply:NEWLINE

$\Sigma = \mathrm{E}(\mathbf{X X^\top}) - \mathbf{\mu}\mathbf{\mu^\top}$
$\Sigma \,$ is positive-semidefinite and symmetric.
$\operatorname{cov}(\mathbf{A X} + \mathbf{a}) = \mathbf{A}\, \operatorname{cov}(\mathbf{X})\, \mathbf{A^\top}$
$\operatorname{cov}(\mathbf{X},\mathbf{Y}) = \operatorname{cov}(\mathbf{Y},\mathbf{X})^\top$
$\operatorname{cov}(\mathbf{X}_1 + \mathbf{X}_2,\mathbf{Y}) = \operatorname{cov}(\mathbf{X}_1,\mathbf{Y}) + \operatorname{cov}(\mathbf{X}_2, \mathbf{Y})$
If p = q, then $\operatorname{var}(\mathbf{X} + \mathbf{Y}) = \operatorname{var}(\mathbf{X}) + \operatorname{cov}(\mathbf{X},\mathbf{Y}) + \operatorname{cov}(\mathbf{Y}, \mathbf{X}) + \operatorname{var}(\mathbf{Y})$
$\operatorname{cov}(\mathbf{AX}, \mathbf{B}^\top\mathbf{Y}) = \mathbf{A}\, \operatorname{cov}(\mathbf{X}, \mathbf{Y}) \,\mathbf{B}$
If $\mathbf{X}$ and $\mathbf{Y}$ are independent, then $\operatorname{cov}(\mathbf{X}, \mathbf{Y}) = 0$

NEWLINE where

\mathbf{X}, \mathbf{X}_1

and

\mathbf{X}_2

are random p×1 vectors,

\mathbf{Y}

is a random q×1 vector,

\mathbf{a}

is q×1 vector,

\mathbf{A}

and

\mathbf{B}

are q×p matrices. This covariance matrix is a useful tool in many different areas. From it a transformation matrix can be derived that allows one to completely decorrelate the data or, from a different point of view, to find an optimal basis for representing the data in a compact way (see Rayleigh quotient for a formal proof and additional properties of covariance matrices). This is called principal components analysis

Principal components analysis

Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to...

(PCA) and Karhunen-Loève transform (KL-transform).

As a linear operator

Applied to one vector, the covariance matrix maps a linear combination, c, of the random variables, X, onto a vector of covariances with those variables:

\mathbf c^\top\Sigma = \operatorname{cov}(\mathbf c^\top\mathbf X,\mathbf X)

. Treated as a 2-form, it yields the covariance between the two linear combinations:

\mathbf d^\top\Sigma\mathbf c=\operatorname{cov}(\mathbf d^\top\mathbf X,\mathbf c^\top\mathbf X)

. The variance of a linear combination is then

\mathbf c^\top\Sigma\mathbf c

, its covariance with itself. Similarly, the (pseudo-)inverse covariance matrix provides an inner product,

\langle c-\mu|\Sigma^+|c-\mu\rangle

which induces the Mahalanobis distance

Mahalanobis distance

In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean...

, a measure of the "unlikelihood" of c.

Which matrices are covariance matrices?

From the identity just above (let

\mathbf{b}

be a

(p \times 1)

real-valued vector)

\operatorname{var}(\mathbf{b}^\top\mathbf{X}) = \mathbf{b}^\top \operatorname{var}(\mathbf{X}) \mathbf{b},\,

the fact that the variance of any real-valued random variable is nonnegative, and the symmetry of the covariance matrix's definition it follows that only a positive-semidefinite matrix can be a covariance matrix. The answer to the converse question, whether every symmetric positive semi-definite matrix is a covariance matrix, is "yes." To see this, suppose M is a p×p positive-semidefinite matrix. From the finite-dimensional case of the spectral theorem

Spectral theorem

In mathematics, particularly linear algebra and functional analysis, the spectral theorem is any of a number of results about linear operators or about matrices. In broad terms the spectral theorem provides conditions under which an operator or a matrix can be diagonalized...

, it follows that M has a nonnegative symmetric square root, which let us call M^1/2. Let

\mathbf{X}

be any p×1 column vector-valued random variable whose covariance matrix is the p×p identity matrix. Then

\operatorname{var}(M^{1/2}\mathbf{X}) = M^{1/2} (\operatorname{var}(\mathbf{X})) M^{1/2} = M.\,

How to find a valid covariance matrix

In some applications (e.g. building data models from only partially observed data) one wants to find the “nearest” covariance matrix to a given symmetric matrix (e.g. of observed covariances). In 2002, Higham formalized the notion of nearness using a weighted Frobenius norm and provided a method for computing the nearest covariance matrix.

Complex random vectors

The variance of a complex

Complex number

A complex number is a number consisting of a real part and an imaginary part. Complex numbers extend the idea of the one-dimensional number line to the two-dimensional complex plane by using the number line for the real part and adding a vertical axis to plot the imaginary part...

scalar-valued random variable with expected value μ is conventionally defined using complex conjugation:

\operatorname{var}(z)
\operatorname{E}
\left[
 (z-\mu)(z-\mu)^{*}
\right]

where the complex conjugate of a complex number

z

is denoted

z^{*}

; thus the variance of a complex number is a real number. If

Z

is a column-vector of complex-valued random variables, then we take the conjugate transpose

Conjugate transpose

In mathematics, the conjugate transpose, Hermitian transpose, Hermitian conjugate, or adjoint matrix of an m-by-n matrix A with complex entries is the n-by-m matrix A* obtained from A by taking the transpose and then taking the complex conjugate of each entry...

by both transposing and conjugating, getting a square matrix:

\operatorname{E}
\left[
 (Z-\mu)(Z-\mu)^{H}
\right]

where

Z^{H}

denotes the conjugate transpose, which is applicable to the scalar case since the transpose of a scalar is still a scalar. The matrix so obtained will be Hermitian positive-semidefinite, with real numbers in the main diagonal and complex numbers off-diagonal.

Estimation

The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle. See estimation of covariance matrices

Estimation of covariance matrices

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution...

Probability density function

If a vector of n possibly correlated random variables is jointly normally distributed, or more generally elliptically distributed

Elliptical distribution

In probability and statistics, an elliptical distribution is any member of a broad family of probability distributions that generalize the multivariate normal distribution and inherit some of its properties.-Definition:...

, then its probability density function

Probability density function

In probability theory, a probability density function , or density of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point. The probability for the random variable to fall within a particular region is given by the...

can be expressed in terms of the covariance matrix.

Definition

Generalization of the variance

Conflicting nomenclatures and notations

Properties

As a linear operator

Which matrices are covariance matrices?

How to find a valid covariance matrix

Complex random vectors

Estimation

Probability density function

See also