Locality sensitive hashing - AbsoluteAstronomy.com

Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

Definition

An LSH family is defined for
a metric space

Metric space

In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...

, a threshold

and an approximation factor

.
An LSH family

is a family of functions

satisfying the following conditions for any two points

, and a function

chosen uniformly at random from

if , then (i.e., and collide) with probability at least ,
if , then with probability at most .

A family is interesting when

. Such a family

is called -sensitive.

An alternative definition is defined with respect to a universe of items

that have a similarity

Similarity

-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...

function

. An LSH scheme is a family of hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

coupled with a probability distribution

over the functions such that a function

chosen according to

satisfies the property that

for any

Applications

LSH has been applied to several problem domains including

Near-duplicate detection
Image similarity identification
Gene expression similarity identification
Audio similarity identification

Bit sampling for Hamming distance

One of the easiest ways to construct an LSH family is by bit sampling .
This approach works for the Hamming distance

Hamming distance

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...

over d-dimensional vectors

. Here, the family

of hash functions is simply the family of all the projections of points on
one of the

coordinates, i.e.,

,
where

is the

th coordinate of

. A random function

from

simply selects a random bit from the input point. This family has the following parameters:

Min-wise independent permutations

Suppose

is composed of subsets of some ground set of enumerable items

and the similarity function of interest is the Jaccard index

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....

. If

is a permutation on the indices of

, for

let

. Each possible choice of

defines a single hash function

mapping input sets to integers.

Define the function family

to be the set of all such functions and let

be the uniform distribution. Given two sets

the event that

corresponds exactly to the event that the minimizer of

lies inside

. As

was chosen uniformly at random,

and

define an LSH scheme for the Jaccard index.

Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen

. It has been established that a min-wise independent family of permutations is at least of size

. and that this boundary is tight

Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families.
Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k .
Approximate min-wise independence differs from the property by at most a fixed

Random projection

The random projection method of LSH is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane

Hyperplane

A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...

(defined by a normal unit vector

) at the outset and use the hyperplane to hash input vectors.

Given an input vector

and a hyperplane defined by

, we let

. That is,

depending on which side of the hyperplane

lies.

Each possible choice of

defines a single function. Let

be the set of all such functions and let

be the uniform distribution once again. It is not difficult to prove that, for two vectors

, where

is the angle between

and

is closely related to

.

In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.

Stable distributions

The hash function

maps a d dimensional vector

onto a set of integers. Each hash function
in the family is indexed by a choice of random

and

where

is a d dimensional
vector with
entries chosen independently from a stable distribution and

is
a real number chosen uniformly from the range [0,r]. For a fixed

the hash function

is
given by

.

Other construction methods for hash functions have been proposed to better fit the data.

In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.

LSH algorithm for nearest neighbor search

One of the main application of LSH is to provide efficient nearest neighbor search

Nearest neighbor search

Nearest neighbor search , also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q...

algorithms.
Consider any LSH family

. The
algorithm has two main parameters: the width parameter

and the
number of hash tables

.

In the first step, we define a new family

of hash
functions

, where each function

is
obtained by concatenating

functions

from

, i.e.,

.
In other words, a random hash function

is obtained by
concatenating

randomly chosen hash functions from

.
The algorithm then constructs

hash tables, each
corresponding to
a different randomly chosen hash function

.

In the preprocessing step we hash all

points from the data set

into each of
the

hash tables.
Given that the resulting hash tables have only

non-zero
entries,
one can reduce the amount of memory used per each hash table to

using standard hash functions.

Given a query point

, the algorithm iterates over the

hash functions

.
For each

considered, it retrieves the data points that
are hashed
into the same bucket as

.
The process is stopped as soon as a point within distance

from

is found.

Given the parameters

and

, the algorithm has
the following performance guarantees:

preprocessing time: , where is the time to evaluate a function on an input point ;
space: , plus the space for storing data points;
query time: ;
the algorithm succeeds in finding a point within distance from (if it exists) with probability at least ;