Kernel trick
Encyclopedia
For machine learning
algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space
V (equipped with its natural norm), without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V. Linear classifications
in V are equivalent to generic classifications in S.
The trick to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.
For on , certain functions can be expressed as an inner product (in usually a different space). K is often referred to as a kernel
or a kernel function. The word kernel is used in different ways throughout mathematics.
If one is lucky or insightful regarding a particular machine learning problem, one may manually construct such that
and verify that is indeed an inner product.
Furthermore, one does not even require an explicit representation for : it suffices to know that V is an inner product space. Conveniently, based on Mercer's theorem
, it suffices to equip S with one's choice of measure
and verify that in fact, satisfies Mercer's condition
.
Mercer's theorem is stated in a general mathematical setting with implications in the theory of integral equations. However, the general statement is overkill for what is required for understanding the kernel trick. Given a finite observation set S, one can simply select the measure for all . Then the integral in Mercer's theorem reduces to a simple summation
for all finite sequences of points x1, ..., xn of S and all choices of real numbers c1, ..., cn (cf. positive definite kernel).
Some algorithms that depend on arbitrary relationships in the native space would, in fact, have a linear interpretation in a different setting: the range space of . The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute directly during computation, as is the case with support vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it justify the meanings and properties of existing algorithms.
The kernel trick was first published by Aizerman et al.
Theoretically, a kernel matrix K must be positive semi-definite (PSD)
. Empirically, for machine learning heuristics, choices of K that do not satisfy Mercer's condition may still perform reasonably if K at least approximates the intuitive idea of similarity. Regardless of whether K is a Mercer kernel, K can still be referred to a "kernel". Suppose K is any square matrix, then is a PSD matrix.
It has been applied to several kinds of algorithm in machine learning
and statistics
, including:
Commonly used kernels in such algorithms include the polynomial kernel, representing a mapping of vectors in into a much richer feature space over degree- polynomials of the original variables:
where is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. For the case of the quadratic kernel, we have:
From this we see that is the inner product in a feature space induced by the mapping
The kernel trick here lies in working in an -dimensional space, without ever explicitly transforming the original data points into that space, but instead relying on algorithms that only need to compute inner products within that space, which are identical to and can thus cheaply be computed in the original space using only multiplications.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space
Inner product space
In mathematics, an inner product space is a vector space with an additional structure called an inner product. This additional structure associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors...
V (equipped with its natural norm), without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V. Linear classifications
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...
in V are equivalent to generic classifications in S.
The trick to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.
For on , certain functions can be expressed as an inner product (in usually a different space). K is often referred to as a kernel
Kernel (mathematics)
In mathematics, the word kernel has several meanings. Kernel may mean a subset associated with a mapping:* The kernel of a mapping is the set of elements that map to the zero element , as in kernel of a linear operator and kernel of a matrix...
or a kernel function. The word kernel is used in different ways throughout mathematics.
If one is lucky or insightful regarding a particular machine learning problem, one may manually construct such that
and verify that is indeed an inner product.
Furthermore, one does not even require an explicit representation for : it suffices to know that V is an inner product space. Conveniently, based on Mercer's theorem
Mercer's theorem
In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most notable results of the work of James Mercer...
, it suffices to equip S with one's choice of measure
Measure (mathematics)
In mathematical analysis, a measure on a set is a systematic way to assign to each suitable subset a number, intuitively interpreted as the size of the subset. In this sense, a measure is a generalization of the concepts of length, area, and volume...
and verify that in fact, satisfies Mercer's condition
Mercer's condition
In mathematics, a real-valued function K is said to fulfill Mercer's condition if for all square integrable functions g one has \iint Kgg\,dx dy \geq 0...
.
Mercer's theorem is stated in a general mathematical setting with implications in the theory of integral equations. However, the general statement is overkill for what is required for understanding the kernel trick. Given a finite observation set S, one can simply select the measure for all . Then the integral in Mercer's theorem reduces to a simple summation
for all finite sequences of points x1, ..., xn of S and all choices of real numbers c1, ..., cn (cf. positive definite kernel).
Some algorithms that depend on arbitrary relationships in the native space would, in fact, have a linear interpretation in a different setting: the range space of . The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute directly during computation, as is the case with support vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it justify the meanings and properties of existing algorithms.
The kernel trick was first published by Aizerman et al.
Theoretically, a kernel matrix K must be positive semi-definite (PSD)
Positive-definite matrix
In linear algebra, a positive-definite matrix is a matrix that in many ways is analogous to a positive real number. The notion is closely related to a positive-definite symmetric bilinear form ....
. Empirically, for machine learning heuristics, choices of K that do not satisfy Mercer's condition may still perform reasonably if K at least approximates the intuitive idea of similarity. Regardless of whether K is a Mercer kernel, K can still be referred to a "kernel". Suppose K is any square matrix, then is a PSD matrix.
It has been applied to several kinds of algorithm in machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
and statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, including:
- PerceptronPerceptronThe perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...
s - Support vector machineSupport vector machineA support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
s - Principal component analysis
- Canonical correlationCanonical correlationIn statistics, canonical correlation analysis, introduced by Harold Hotelling, is a way of making sense of cross-covariance matrices. If we have two sets of variables, x_1, \dots, x_n and y_1, \dots, y_m, and there are correlations among the variables, then canonical correlation analysis will...
analysis - FisherRonald FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...
's linear discriminant analysisLinear discriminant analysisLinear discriminant analysis and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events... - Cluster analysis
Commonly used kernels in such algorithms include the polynomial kernel, representing a mapping of vectors in into a much richer feature space over degree- polynomials of the original variables:
where is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. For the case of the quadratic kernel, we have:
From this we see that is the inner product in a feature space induced by the mapping
The kernel trick here lies in working in an -dimensional space, without ever explicitly transforming the original data points into that space, but instead relying on algorithms that only need to compute inner products within that space, which are identical to and can thus cheaply be computed in the original space using only multiplications.
See also
- Kernel methodsKernel methodsIn computer science, kernel methods are a class of algorithms for pattern analysis, whose best known elementis the support vector machine...
- Integral transforms
- Hilbert spaceHilbert spaceThe mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions...
, specifically reproducing kernel Hilbert spaceReproducing kernel Hilbert spaceIn functional analysis , a reproducing kernel Hilbert space is a Hilbert space of functions in which pointwise evaluation is a continuous linear functional. Equivalently, they are spaces that can be defined by reproducing kernels... - Mercer kernel