Kernel trick - AbsoluteAstronomy.com

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space

Inner product space

In mathematics, an inner product space is a vector space with an additional structure called an inner product. This additional structure associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors...

V (equipped with its natural norm), without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V. Linear classifications

Linear classifier

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...

in V are equivalent to generic classifications in S.
The trick to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.

For

, certain functions

can be expressed as an inner product (in usually a different space). K is often referred to as a kernel
Kernel (mathematics)
In mathematics, the word kernel has several meanings. Kernel may mean a subset associated with a mapping:* The kernel of a mapping is the set of elements that map to the zero element , as in kernel of a linear operator and kernel of a matrix...

or a kernel function. The word kernel is used in different ways throughout mathematics.

If one is lucky or insightful regarding a particular machine learning problem, one may manually construct

such that

and verify that

is indeed an inner product.

Furthermore, one does not even require an explicit representation for

: it suffices to know that V is an inner product space. Conveniently, based on Mercer's theorem

Mercer's theorem

In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most notable results of the work of James Mercer...

, it suffices to equip S with one's choice of measure

Measure (mathematics)

In mathematical analysis, a measure on a set is a systematic way to assign to each suitable subset a number, intuitively interpreted as the size of the subset. In this sense, a measure is a generalization of the concepts of length, area, and volume...

and verify that in fact,

satisfies Mercer's condition

Mercer's condition

In mathematics, a real-valued function K is said to fulfill Mercer's condition if for all square integrable functions g one has \iint Kgg\,dx dy \geq 0...

.

Mercer's theorem is stated in a general mathematical setting with implications in the theory of integral equations. However, the general statement is overkill for what is required for understanding the kernel trick. Given a finite observation set S, one can simply select the measure

for all

. Then the integral in Mercer's theorem reduces to a simple summation

for all finite sequences of points x₁, ..., x_n of S and all choices of real numbers c₁, ..., c_n (cf. positive definite kernel).

Some algorithms that depend on arbitrary relationships in the native space would, in fact, have a linear interpretation in a different setting: the range space of

. The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute

directly during computation, as is the case with support vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it justify the meanings and properties of existing algorithms.

The kernel trick was first published by Aizerman et al.

Theoretically, a kernel matrix K must be positive semi-definite (PSD)

Positive-definite matrix

In linear algebra, a positive-definite matrix is a matrix that in many ways is analogous to a positive real number. The notion is closely related to a positive-definite symmetric bilinear form ....

. Empirically, for machine learning heuristics, choices of K that do not satisfy Mercer's condition may still perform reasonably if K at least approximates the intuitive idea of similarity. Regardless of whether K is a Mercer kernel, K can still be referred to a "kernel". Suppose K is any square matrix, then

is a PSD matrix.
It has been applied to several kinds of algorithm in machine learning

Machine learning

and statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, including:

Perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

s
Support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s
Principal component analysis
Canonical correlation
Canonical correlation
In statistics, canonical correlation analysis, introduced by Harold Hotelling, is a way of making sense of cross-covariance matrices. If we have two sets of variables, x_1, \dots, x_n and y_1, \dots, y_m, and there are correlations among the variables, then canonical correlation analysis will...

analysis
Fisher
Ronald Fisher
Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

's linear discriminant analysis
Linear discriminant analysis
Linear discriminant analysis and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events...
Cluster analysis

Commonly used kernels in such algorithms include the polynomial kernel, representing a mapping of vectors in

into a much richer feature space over degree-

polynomials of the original variables:

where

is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. For the case

of the quadratic kernel, we have:

From this we see that

is the inner product in a feature space induced by the mapping

The kernel trick here lies in working in an

-dimensional space, without ever explicitly transforming the original data points into that space, but instead relying on algorithms that only need to compute inner products

within that space, which are identical to

and can thus cheaply be computed in the original space using only

multiplications.

See also