Dimensionality reduction
Encyclopedia
In machine learning
, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection
and feature extraction
.
approaches try to find a subset of the original variables (also called features or attributes). Two strategies are filter (e.g. information gain
) and wrapper (e.g. search guided by the accuracy) approaches. See also combinatorial optimization
problems.
In some cases, data analysis
such as regression
or classification can be done in the reduced space more accurately than in the original space.
transforms the data in the high-dimensional space to a space of fewer dimension
s. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction
techniques also exist.
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors
on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.
Principal component analysis can be employed in a nonlinear way by means of the kernel trick
. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is entitled Kernel PCA. Other prominent nonlinear techniques include manifold learning techniques such as locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA
. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming
. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.
An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling
(which is identical to PCA), Isomap
(which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.
A different approach to nonlinear dimensionality reduction is through the use of autoencoder
s, a special kind of feed-forward neural network
s with a bottle-neck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of Restricted Boltzmann machine
s) that is followed by a finetuning stage based on backpropagation
.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
and feature extraction
Feature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...
.
Feature selection
Feature selectionFeature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
approaches try to find a subset of the original variables (also called features or attributes). Two strategies are filter (e.g. information gain
Information gain in decision trees
In information theory and machine learning, information gain is an alternative synonym for Kullback–Leibler divergence.In particular, the information gain about a random variable X obtained from an observation that a random variable A takes the value A=a is the Kullback-Leibler divergence DKL of...
) and wrapper (e.g. search guided by the accuracy) approaches. See also combinatorial optimization
Combinatorial optimization
In applied mathematics and theoretical computer science, combinatorial optimization is a topic that consists of finding an optimal object from a finite set of objects. In many such problems, exhaustive search is not feasible...
problems.
In some cases, data analysis
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
such as regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
or classification can be done in the reduced space more accurately than in the original space.
Feature extraction
Feature extractionFeature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...
transforms the data in the high-dimensional space to a space of fewer dimension
Dimension
In physics and mathematics, the dimension of a space or object is informally defined as the minimum number of coordinates needed to specify any point within it. Thus a line has a dimension of one because only one coordinate is needed to specify a point on it...
s. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction
Nonlinear dimensionality reduction
High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies on an embedded non-linear manifold within the higher-dimensional space...
techniques also exist.
The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors
Eigenvalue, eigenvector and eigenspace
The eigenvectors of a square matrix are the non-zero vectors that, after being multiplied by the matrix, remain parallel to the original vector. For each eigenvector, the corresponding eigenvalue is the factor by which the eigenvector is scaled when multiplied by the matrix...
on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.
Principal component analysis can be employed in a nonlinear way by means of the kernel trick
Kernel trick
For machine learning algorithms, the kernel trick is a way of mapping observations from a general set S into an inner product space V , without ever having to compute the mapping explicitly, in the hope that the observations will gain meaningful linear structure in V...
. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is entitled Kernel PCA. Other prominent nonlinear techniques include manifold learning techniques such as locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA
Local tangent space alignment
Local tangent space alignment is a method for manifold learning, which can efficiently learn a nonlinear embedding into low-dimensional coordinates from high-dimensional data, and can also reconstruct high-dimensional coordinates from embedding coordinates...
. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming
Semidefinite programming
Semidefinite programming is a subfield of convex optimization concerned with the optimization of a linear objective functionover the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron....
. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.
An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling
Multidimensional scaling
Multidimensional scaling is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. MDS is a special case of ordination. An MDS algorithm starts with a matrix of item–item similarities, then assigns a location to each...
(which is identical to PCA), Isomap
Isomap
In statistics, Isomap is one of several widely used low-dimensional embedding methods, where geodesic distances on a weighted graph are incorporated with the classical scaling . Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points...
(which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.
A different approach to nonlinear dimensionality reduction is through the use of autoencoder
Autoencoder
An auto-encoder is an artificial neural network used for learning efficient codings.The aim of an auto-encoder is to learn a compressed representation for a set of data....
s, a special kind of feed-forward neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...
s with a bottle-neck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of Restricted Boltzmann machine
Boltzmann machine
A Boltzmann machine is a type of stochastic recurrent neural network invented by Geoffrey Hinton and Terry Sejnowski. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets...
s) that is followed by a finetuning stage based on backpropagation
Backpropagation
Backpropagation is a common method of teaching artificial neural networks how to perform a given task. Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969 . It wasn't until 1974 and later, when applied in the context of neural networks and...
.
See also
- Dimension (metadata)Dimension (metadata)In metadata, dimension is a set of equivalent units of measure, where equivalence between two units of measure is determined by the existence of a quantity preserving one-to-one correspondence between values measured in one unit of measure and values measured in the other unit of measure,...
- Curse of dimensionalityCurse of dimensionalityThe curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...
- Nearest neighbor searchNearest neighbor searchNearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...
- Cluster analysis
- Feature spaceFeature spaceIn pattern recognition a feature space is an abstract space where each pattern sample is represented as a point in n-dimensional space. Its dimension is determined by the number of features used to describe the patterns...
- Data miningData miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
- Machine learningMachine learningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
- Feature selectionFeature selectionIn machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
- Information gainInformation gain in decision treesIn information theory and machine learning, information gain is an alternative synonym for Kullback–Leibler divergence.In particular, the information gain about a random variable X obtained from an observation that a random variable A takes the value A=a is the Kullback-Leibler divergence DKL of...
- Chi-squared distribution
- Recursive feature elimination (SVM based)
- Information gain
- Feature extractionFeature extractionIn pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...
- Principal components analysisPrincipal components analysisPrincipal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components. The number of principal components is less than or equal to...
- Semidefinite embeddingSemidefinite embeddingSemidefinite embedding or maximum variance unfolding is an algorithm in computer science, that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data....
- Multifactor dimensionality reductionMultifactor dimensionality reductionMultifactor dimensionality reduction is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable...
- Multilinear subspace learningMultilinear subspace learningMultilinear subspace learning aims to learn a specific small part of a large space of multidimensional objects having a particular desired property. It is a dimensionality reduction approach for finding a low-dimensional representation with certain preferred characteristics of high-dimensional...
- Nonlinear dimensionality reductionNonlinear dimensionality reductionHigh-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies on an embedded non-linear manifold within the higher-dimensional space...
- IsomapIsomapIn statistics, Isomap is one of several widely used low-dimensional embedding methods, where geodesic distances on a weighted graph are incorporated with the classical scaling . Isomap is used for computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points...
- Kernel PCA
- Multilinear PCA
- Singular value decompositionSingular value decompositionIn linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....
- Latent semantic analysisLatent semantic analysisLatent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
- Semantic mapping
- Principal components analysis
- WaveletWaveletA wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have...
- Wavelet compression
- Haar waveletHaar waveletIn mathematics, the Haar wavelet is a certain sequence of rescaled "square-shaped" functions which together form a wavelet family or basis. Wavelet analysis is similar to Fourier analysis in that it allows a target function over an interval to be represented in terms of an orthonormal function basis...
- Fourier-related transforms
- Fast Fourier transformFast Fourier transformA fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...
- Discrete Fourier transformDiscrete Fourier transformIn mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...
- Discrete cosine transformDiscrete cosine transformA discrete cosine transform expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies. DCTs are important to numerous applications in science and engineering, from lossy compression of audio and images A discrete cosine transform...
- Fast Fourier transform
- Linear least squaresLinear least squaresIn statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...
- Topological data analysisTopological data analysisTopological data analysis is a new area of study aimed at having applications in areas such as data mining and computer vision.The main problems are:# how one infers high-dimensional structure from low-dimensional representations; and...
- PhysicsPhysicsPhysics is a natural science that involves the study of matter and its motion through spacetime, along with related concepts such as energy and force. More broadly, it is the general analysis of nature, conducted in order to understand how the universe behaves.Physics is one of the oldest academic...
- magnetismMagnetismMagnetism is a property of materials that respond at an atomic or subatomic level to an applied magnetic field. Ferromagnetism is the strongest and most familiar type of magnetism. It is responsible for the behavior of permanent magnets, which produce their own persistent magnetic fields, as well...
- magnetism
- Time seriesTime seriesIn statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...
- Locality sensitive hashingLocality sensitive hashingLocality-sensitive hashing is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability .-Definition:An LSH family \mathcal F is defined fora...
- Signal subspaceSignal subspaceIn signal processing, signal subspace methods are empirical linear methods for dimensionality reduction and noise reduction. These approaches have attracted significant interest and investigation recently in the context of speech enhancement, speech modeling, and speech classification...
- Sufficient dimension reductionSufficient dimension reductionIn statistics, sufficient dimension reduction is a paradigm for analyzing data that combines the ideas of dimension reduction with the concept of sufficiency.Dimension reduction has long been a primary goal of regression analysis...
External links
- A survey of dimension reduction techniques (US DOE Office of Scientific and Technical Information, 2002)
- Technical Report on Dimension Reduction (24 pages)
- JMLR Special Issue on Variable and Feature Selection
- ELastic MAPs
- Locally Linear Embedding
- A Global Geometric Framework for Nonlinear Dimensionality Reduction
- Dimensional reduction at a quantum critical point (realisation of dimensional reduction in a magnet)
- Matlab Toolbox for Dimensionality Reduction
- kernlab - R package for kernel-based machine learning (includes kernel PCA, and SVM)
- DD-HDS homepage
- When Is "Nearest Neighbor" Meaningful? - Exploring the effect of dimensionality on nearest neighbor problem