data:image/s3,"s3://crabby-images/ca364/ca36463f9b82c90b6f3e27c6326cdcdc31617e4c" alt=""
Neighbourhood components analysis
Encyclopedia
Neighbourhood components analysis is a supervised learning
method for clustering multivariate
data into distinct classes according to a given distance metric
over the data. Functionally, it serves the same purposes as the K-nearest neighbour algorithm, and makes direct use of a related concept termed stochastic nearest neighbours.
corresponding to the transformation can be found by defining a differentiable objective function for
, followed by use of an iterative solver such as conjugate gradient descent
. One of the benefits of this algorithm is that the number of classes
can be determined as a function of
, up to a scalar constant. This use of the algorithm therefore addresses the issue of model selection
.
, we define an objective function describing classification accuracy in the transformed space and try to determine
such that this objective function is maximized.
data:image/s3,"s3://crabby-images/48847/488475bb3327f350f9ffbcc0186335c31b3cff97" alt=""
-nearest neighbours with a given distance metric. This is known as leave-one-out classification. However, the set of nearest-neighbours
can be quite different after passing all the points through a linear transformation. Specifically, the set of neighbours for a point can undergo discrete changes in response to smooth changes in the elements of
, implying that any objective function
based on the neighbours of a point will be piecewise-constant, and hence not differentiable.
. Rather than considering the
-nearest neighbours at each transformed point in LOO-classification, we'll consider the entire transformed data set as stochastic nearest neighbours. We define these using a softmax function
of the squared Euclidean distance
between a given LOO-classification point and each other point in the transformed space:
data:image/s3,"s3://crabby-images/13cdf/13cdf31a6c74a8abf58c61c08e87d19a27f434ed" alt=""
The probability of correctly classifying data point
is the probability of classifying the points of each of its neighbours
:
where
is the probability of classifying neighbour
of point
.
Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:
data:image/s3,"s3://crabby-images/8bf7a/8bf7ad5fa3cd7bec7c5b8d465cee5fe961102eff" alt=""
Note that under stochastic nearest neighbours, the consensus class for a single point
is the expected value of a point's class in the limit of an infinite number of samples drawn from the distribution over its neighbours
i.e.:
. Thus the predicted class is an affine combination of the classes of every other point, weighted by the softmax function for each
where
is now the entire transformed data set.
This choice of objective function is preferable as it is differentiable with respect to
:
data:image/s3,"s3://crabby-images/ce833/ce833bb713181c2ee6ec6bea2cd193c3335688b7" alt=""
data:image/s3,"s3://crabby-images/34c3e/34c3e35748da695ba23adc969dce47e19b57aaed" alt=""
Obtaining a gradient
for
means that it can be found with an iterative solver such as conjugate gradient descent
. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.
is equivalent to minimizing the
-distance between the predicted class distribution and the true class distribution (ie: where the
induced by
are all equal to 1). A natural alternative is the KL-divergence, which induces the following objective function and gradient:" (Goldberger 2005)
data:image/s3,"s3://crabby-images/63d76/63d76851ae491115036f55b7b029325455ef4d4b" alt=""
data:image/s3,"s3://crabby-images/42996/42996d59fd9475443a8c7703f598466a60630ef2" alt=""
In practice, optimization of
using this function tends to give similar performance results as with the original.
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
method for clustering multivariate
Multivariate statistics
Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one statistical variable. The application of multivariate statistics is multivariate analysis...
data into distinct classes according to a given distance metric
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
over the data. Functionally, it serves the same purposes as the K-nearest neighbour algorithm, and makes direct use of a related concept termed stochastic nearest neighbours.
Definition
Neighbourhood components analysis aims at "learning" a distance metric by finding a linear transformation of input data such that the average leave-one-out (LOO) classification performance is maximized in the transformed space. The key insight to the algorithm is that a matrixdata:image/s3,"s3://crabby-images/71dc4/71dc4b2270a93fafbd42180f3b4769047aef4c42" alt=""
data:image/s3,"s3://crabby-images/724e1/724e17d8f10819b15c9e7a8c19b07a6b823d5c15" alt=""
Conjugate gradient method
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too...
. One of the benefits of this algorithm is that the number of classes
data:image/s3,"s3://crabby-images/770d0/770d075d42732e3461ce5c4bb58d661f06b3cf33" alt=""
data:image/s3,"s3://crabby-images/485fc/485fcc151039e72bf87fe27f283c5274311c4ab7" alt=""
Model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered...
.
Explanation
In order to definedata:image/s3,"s3://crabby-images/2c627/2c627eedc1da898ddfe0fb5f2acd273031320a27" alt=""
data:image/s3,"s3://crabby-images/f4dd7/f4dd7b0dad52a4591572c0e16bae149f6554f788" alt=""
data:image/s3,"s3://crabby-images/48847/488475bb3327f350f9ffbcc0186335c31b3cff97" alt=""
Leave-one-out (LOO) classification
Consider predicting the class label of a single data point by consensus of itsdata:image/s3,"s3://crabby-images/ce968/ce9688f1efdceabe24c9d87a83aedf67bd165fe4" alt=""
data:image/s3,"s3://crabby-images/c34a8/c34a8f730aef746fa319e44bce7cc2fc57f5d8b0" alt=""
data:image/s3,"s3://crabby-images/7ebb3/7ebb3bd694c8a9e99cd15342f34d5526a96649e0" alt=""
data:image/s3,"s3://crabby-images/b4554/b4554216b3ca45a657fb897f0431c4ff48de2300" alt=""
Solution
We can resolve this difficulty by using an approach inspired by stochastic gradient descentStochastic gradient descent
Stochastic gradient descent is an optimization method for minimizing an objective function that is written as a sum of differentiable functions.- Background :...
. Rather than considering the
data:image/s3,"s3://crabby-images/dc208/dc2085c0e4a3caf649916a1e2bf6fe4c51ffb0b3" alt=""
Softmax activation function
The softmax activation function is a neural transfer function. In neural networks, transfer functions calculate a layer's output from its net input. It is a biologically plausible approximation to the maximum operation...
of the squared Euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
between a given LOO-classification point and each other point in the transformed space:
data:image/s3,"s3://crabby-images/13cdf/13cdf31a6c74a8abf58c61c08e87d19a27f434ed" alt=""
The probability of correctly classifying data point
data:image/s3,"s3://crabby-images/fdd69/fdd6964c43b315f4b783f3a8afd9fe1d4319e276" alt=""
data:image/s3,"s3://crabby-images/7f4c7/7f4c7871d475d4a515d6c2535895621cc56818bb" alt=""
data:image/s3,"s3://crabby-images/3722c/3722c778a5ae2543c30fbb2f3dced403245259fd" alt=""
data:image/s3,"s3://crabby-images/06a99/06a99dc4f7ecde42a79ca8e504f69ab510bee2ac" alt=""
data:image/s3,"s3://crabby-images/ce3c5/ce3c54fcb96ba03abfd34526de27c90c604d3c22" alt=""
data:image/s3,"s3://crabby-images/e138a/e138a2d622ee4f9ffa95bcb73049c75695d991c1" alt=""
Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:
data:image/s3,"s3://crabby-images/8bf7a/8bf7ad5fa3cd7bec7c5b8d465cee5fe961102eff" alt=""
Note that under stochastic nearest neighbours, the consensus class for a single point
data:image/s3,"s3://crabby-images/f121e/f121eefedbcd1e76fe0e1f5f7ec81794b1d9a509" alt=""
data:image/s3,"s3://crabby-images/1a17b/1a17b6a4e1d099fc8b4a6d597b81909d4c0ae64b" alt=""
data:image/s3,"s3://crabby-images/9c66a/9c66ab70b6a676c2fc4bd0d31e2259a96ef91b97" alt=""
data:image/s3,"s3://crabby-images/1683d/1683d259a4841cad8d778e040dcfe70f908ff627" alt=""
data:image/s3,"s3://crabby-images/fe297/fe29793117e7cf820bf8bca129326361cdc82297" alt=""
This choice of objective function is preferable as it is differentiable with respect to
data:image/s3,"s3://crabby-images/4e249/4e249b4544aff7f3debccf9de659611a736c124e" alt=""
data:image/s3,"s3://crabby-images/ce833/ce833bb713181c2ee6ec6bea2cd193c3335688b7" alt=""
data:image/s3,"s3://crabby-images/34c3e/34c3e35748da695ba23adc969dce47e19b57aaed" alt=""
Obtaining a gradient
Gradient
In vector calculus, the gradient of a scalar field is a vector field that points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change....
for
data:image/s3,"s3://crabby-images/eb6c0/eb6c04d3e3cb886c9d22cc9efd27aca8e5ad3d77" alt=""
Conjugate gradient method
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is an iterative method, so it can be applied to sparse systems that are too...
. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.
Alternative formulation
"Maximizingdata:image/s3,"s3://crabby-images/78231/782317850b85c4d54aedb0c1db49c2939db8ad83" alt=""
data:image/s3,"s3://crabby-images/fc8f6/fc8f6dd0c621f911378930fbb6f1863c1f744a08" alt=""
data:image/s3,"s3://crabby-images/0fbfe/0fbfe60f50396131e4fa47048a27485287a2094a" alt=""
data:image/s3,"s3://crabby-images/7e5f9/7e5f9ec4a665a0c36a769f55a8ee5dfee32be85d" alt=""
data:image/s3,"s3://crabby-images/63d76/63d76851ae491115036f55b7b029325455ef4d4b" alt=""
data:image/s3,"s3://crabby-images/42996/42996d59fd9475443a8c7703f598466a60630ef2" alt=""
In practice, optimization of
data:image/s3,"s3://crabby-images/6f9b7/6f9b70dfe3203462bbed89cd815e37bff457858f" alt=""
History and background
Neighbourhood components analysis was developed by Jacob Goldberger, Sam Roweis, Ruslan Salakhudinov, and Geoff Hinton at the University of Toronto's department of computer science in 2004.General
- Neighbourhood Components Analysis (University of Toronto DCS)