
Locality sensitive hashing
Encyclopedia
Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash
the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).
is defined for
a metric space
, a threshold
and an approximation factor
.
An LSH family
is a family of functions
satisfying the following conditions for any two points
, and a function
chosen uniformly at random from
:
A family is interesting when
. Such a family
is called
-sensitive.
An alternative definition is defined with respect to a universe of items
that have a similarity
function
. An LSH scheme is a family of hash function
s
coupled with a probability distribution
over the functions such that a function
chosen according to
satisfies the property that
for any
.
This approach works for the Hamming distance
over d-dimensional vectors
. Here, the family
of hash functions is simply the family of all the projections of points on
one of the
coordinates, i.e.,
,
where
is the
th coordinate of
. A random function
from
simply selects a random bit from the input point. This family has the following parameters:
,
.
is composed of subsets of some ground set of enumerable items
and the similarity function of interest is the Jaccard index
. If
is a permutation on the indices of
, for
let
. Each possible choice of
defines a single hash function
mapping input sets to integers.
Define the function family
to be the set of all such functions and let
be the uniform distribution. Given two sets
the event that
corresponds exactly to the event that the minimizer of
lies inside
. As
was chosen uniformly at random,
and
define an LSH scheme for the Jaccard index.
Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen
. It has been established that a min-wise independent family of permutations is at least of size
. and that this boundary is tight
Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families.
Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k .
Approximate min-wise independence differs from the property by at most a fixed
.
(defined by a normal unit vector
) at the outset and use the hyperplane to hash input vectors.
Given an input vector
and a hyperplane defined by
, we let
. That is,
depending on which side of the hyperplane
lies.
Each possible choice of
defines a single function. Let
be the set of all such functions and let
be the uniform distribution once again. It is not difficult to prove that, for two vectors
,
, where
is the angle between
and
.
is closely related to
.
In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.
maps a d dimensional vector
onto a set of integers. Each hash function
in the family is indexed by a choice of random
and
where
is a d dimensional
vector with
entries chosen independently from a stable distribution and
is
a real number chosen uniformly from the range [0,r]. For a fixed
the hash function
is
given by
.
Other construction methods for hash functions have been proposed to better fit the data.
In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.
algorithms.
Consider any LSH family
. The
algorithm has two main parameters: the width parameter
and the
number of hash tables
.
In the first step, we define a new family
of hash
functions
, where each function
is
obtained by concatenating
functions
from
, i.e.,
.
In other words, a random hash function
is obtained by
concatenating
randomly chosen hash functions from
.
The algorithm then constructs
hash tables, each
corresponding to
a different randomly chosen hash function
.
In the preprocessing step we hash all
points from the data set
into each of
the
hash tables.
Given that the resulting hash tables have only
non-zero
entries,
one can reduce the amount of memory used per each hash table to
using standard hash functions.
Given a query point
, the algorithm iterates over the
hash functions
.
For each
considered, it retrieves the data points that
are hashed
into the same bucket as
.
The process is stopped as soon as a point within distance
from
is found.
Given the parameters
and
, the algorithm has
the following performance guarantees:
For a fixed approximation ratio
and probabilities
and
, one can set
and
, where
.
Then one obtains the following performance guarantees:
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).
Definition
An LSH family
a metric space
Metric space
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...



An LSH family





- if
, then
(i.e.,
and
collide) with probability at least
,
- if
, then
with probability at most
.
A family is interesting when



An alternative definition is defined with respect to a universe of items

Similarity
-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...
function

Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
s






Applications
LSH has been applied to several problem domains including- Near-duplicate detection
- Image similarity identification
- Gene expression similarity identification
- Audio similarity identification
Bit sampling for Hamming distance
One of the easiest ways to construct an LSH family is by bit sampling .This approach works for the Hamming distance
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...
over d-dimensional vectors


one of the


where







Min-wise independent permutations
Suppose

Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....







Define the function family









Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen


Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families.
Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k .
Approximate min-wise independence differs from the property by at most a fixed

Random projection
The random projection method of LSH is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplaneHyperplane
A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...
(defined by a normal unit vector

Given an input vector





Each possible choice of










In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.
Stable distributions
The hash function

in the family is indexed by a choice of random



vector with
entries chosen independently from a stable distribution and

a real number chosen uniformly from the range [0,r]. For a fixed


given by

Other construction methods for hash functions have been proposed to better fit the data.
In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.
LSH algorithm for nearest neighbor search
One of the main application of LSH is to provide efficient nearest neighbor searchNearest neighbor search
Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...
algorithms.
Consider any LSH family

algorithm has two main parameters: the width parameter

number of hash tables

In the first step, we define a new family

functions


obtained by concatenating




In other words, a random hash function

concatenating


The algorithm then constructs

corresponding to
a different randomly chosen hash function

In the preprocessing step we hash all


the

Given that the resulting hash tables have only

entries,
one can reduce the amount of memory used per each hash table to

Given a query point



For each

are hashed
into the same bucket as

The process is stopped as soon as a point within distance


Given the parameters


the following performance guarantees:
- preprocessing time:
, where
is the time to evaluate a function
on an input point
;
- space:
, plus the space for storing data points;
- query time:
;
- the algorithm succeeds in finding a point within distance
from
(if it exists) with probability at least
;
For a fixed approximation ratio






Then one obtains the following performance guarantees:
- preprocessing time:
;
- space:
, plus the space for storing data points;
- query time:
;
See also
- Nearest neighbor searchNearest neighbor searchNearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...
- Dimension reduction
- Principal component analysis
- Singular value decompositionSingular value decompositionIn linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....
- Curse of dimensionalityCurse of dimensionalityThe curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...
- Cluster analysis
- Wavelet compression
- Fourier-related transforms
Further reading
- Samet, H. (2006) Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 0-12-369446-9
External links
- Alex Andoni's LSH homepage
- LSHKIT: A C++ Locality Sensitive Hashing Library
- Caltech Large Scale Image Search Toolbox: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.