Locality sensitive search is often used in searching for similar objects in a large dataset. An example is to implement the K nearest neighbors (KNN) algorithm for big data.

For a small dataset, it is feasible to compute pairwise similarities or distances for all data instances, but for a large dataset, it is impossible. For example, given a dataset of N instances, if we compute the pairwise similarities, there will be N^2 values. The output similarity matrix is symmetric, even we only take the upper or the lower part of this matrix, there are N^(N-1)/2 values. When N is small, this computation is doable. However, if there are 1 million data instances, it will take 16 years to get all similarities, see here for detailed computation.

In fact, in most cases, we only care about data instances that are close to each other, thus large similarities or small distances are in our interest. As to those data instances who are far form each other, we usually don't need to waste time on them. So before we come to thoroughly compute the expensive pairwise similarities, it would be great if we could use some method to get an idea on which group of data instances might be similar to each other. Then we merely compute pairwise similarities of these data instances found in this group. In this case, we will save much time in looking for the similar objects. The method which is capable to smartly, rapidly but approximately find some groups of similar data instances, is the Locality Sensitive Hashing (LSH).

- A brief theory:

A detailed introduction of LSH can be found here. In brief, the input data is firstly converted into signatures, which can represent the data. Then LSH assigns similar data instances into different "bucket" via their signatures. Depending on the similarity metric adopted to compute similarities, either the Jaccard similarity, the cosine similarity, or others, the way of generating dataset's signatures vary. This paper details more on this subject.

In searching similar web-pages or documents, the Jaccard similarity is commonly used. In this case, the MinHashing is then required to compute the signatures before LSH is applied. Some good illustration can be found in this video and in this article. In brief, given a binary set {0,1,1}. A hashing function will generate a digit for each element in this set. MinHashing requires several different hash functions, let's say these functions are h1, h2, h3, ..., h8. Each hash function will operation on every element in the set, the smallest hash value is selected as the signature. In our case, there are three elements in the set, h1 generates three hash values, the smallest is the signature. The same for h2, h3, ... h8. In the end, there will be 4 signatures for this set. The pseudo-code is like follows:

MinHash(s) = min(h(x) for all x in s) # x: an element, s: a binary set

I think this process is quite similar to projecting the input dataset into another space. If the input dataset matrix is of shape n-by-m, n is the number of instances, m is the number of features. Given k hash functions, the signature matrix is of shape n-by-k.

Once the signature matrix is obtained, it is time to apply LSH to assign data instances (objects) into different buckets. Each bucket contain several approximately similar objects. There is some probability knowledge behind this, for details, please see the text book of Mining Massive Datasets. Briefly, the first thing you should do is to separate the signatures into a few bands, each band contains an equal number of signatures. For example, in our simple example, we have 8 signatures output for the set of {0,1,1}. We can give 2 bands, thus each band contains 4 signatures. Next is to choose another function to hash all the signatures in each band. In return, two more hash values are generated, these can be considered as the index of buckets. When there are several sets instead of one, we will see a few of them collide into one bucket, some fall into the others. After this step, what is needed is to compute pairwise similarities in each bucket.

The output of LSH will be a "bucket" matrix of n-by-b.

- An detailed small example:

Well, after a lot of theory above, I hope a simple example can explain all. (The following example is written in Pyhon2.7)

Let's say we work on this binary dataset generated by:

import numpy as np mat = np.random.choice([0,1], size=(10,6), p=(.1, .9)) # randomly generate a binary matrix, with probability=0.1 to generate 0. print mat

This is a small binary data matrix of shape 10-by-6. And we have 8 hash functions, generated by:

def hash_factory(n): return lambda x: hash("salt" + unicode(n) + unicode(x) + "salt") n_hashf = 8 hashes = [hash_factory(_) for _ in range(n_hashf)] # a list 8 hash functions

First step, we need to do is to generate the signature matrix for the input, this is done by:

def min_hash_sig(s, hf): """ input a set s output it's minhash signature of a hash function """ return min(hf(val) for val in s) sig_ls = [] # a list of signatures, a list of 10 list, each list contain 8 MinHash signatures for s in mat: # for each row in mat sig = map(lambda f: min_hash_sig(s,f), hashes) # compute MinHash signature sig_ls.append(sig) print sig_ls[0] # print MinHash sigmatures for the 1st row in mat [1180994143573573947, 8073458713907736788, 2087231956330381401, -2440012934056873006, -2872639764063778585, 8912896952121485472, 2926734194800130405, -8627865310335260834]

Second, define the number of bands, b, the number of signatures in each band, r, and apply LSH.

r = 4 # number of signatures in each band b = 2 # number of bands t = (1. / b) ** (1. / r) # t is related to the probability that the signatures all agree in at least one band print "t:", t lsh = [] for s in range(len(sig_ls)): sig = sig_ls[s] for i, band in enumerate(range(b)): lsh.append(hash(tuple(sig[i*r:i*r+r]))) # hash signatures in one band lsh_arr = np.array(lsh).reshape(10,b) # convert lsh from a list to a array of shape n-by-b print lsh_arr

t: 0.840896415254

[[ 4687445064954392987 1930194277869531077] [-4502860058849100603 -3875320495035685115] [-4502860058849100603 -3875320495035685115] [-4502860058849100603 -3875320495035685115] [ 4687445064954392987 1930194277869531077] [ 4687445064954392987 1930194277869531077] [ 4687445064954392987 1930194277869531077] [-4502860058849100603 -3875320495035685115] [ 4687445064954392987 1930194277869531077] [ 4687445064954392987 1930194277869531077]]

The lsh_arr as 10 rows and two columns, each maps to a row in the input binary matrix, each column maps to a band, it contains two different bucket indexes. From this input, we can see which bucket each data instance is assigned to. A better view can be achieved by:

from collections import defaultdict

` `

hashmap = [defaultdict(list) for i in range(b)] for j in range(b): lsh_b = lsh_arr[:,j] n = lsh_arr.shape[0] for i in range(n): hashmap[j][lsh_b[i]].append(i)

` `

print hashmap [defaultdict(<type 'list'>, {4687445064954392987: [0, 4, 5, 6, 8, 9], -4502860058849100603: [1, 2, 3, 7]}), defaultdict(<type 'list'>, {1930194277869531077: [0, 4, 5, 6, 8, 9], -3875320495035685115: [1, 2, 3, 7]})]

defaultdict is used here to generate a list of dictionaries, each dictionary has a bucket index as key, and a list of row indexes as value. If you don't know much about python defaultdict, here is a very nice explanation on it.

It is more clear to see that rows at indexes 1, 2, 3 and 7 (in the binary input matrix) are considered as similar, that's why they are assigned to the same bucket. And rows at indexes of 0, 4, 5, 6, 8, 9 are assigned to a different bucket.

With an objective, which is either to perform KNN or to find the closest pair, instead of computing 45 pairwise similarities (=10x(10-1)/2), we just need to compute 21 pairwise similarities (=4x(4-1)/2 + 6x(6-1)/2).

To verify the results, we can thouroughly compute the pairwise Jaccard similarities of the binary input matrix by:

# compute pairwise jaccard similarity, it outpouts n(n-1)/2 similarities from sklearn.metrics import jaccard_similarity_score as js val_dict = {} n = mat.shape[0] for r in range(n-1): for r_ in range(r+1, n): count +=1 sim = js(mat[r], mat[r_]) if sim not in val_dict: val_dict[sim] = [r, r_] else: val_dict[sim].append([r, r_]) for key in val_dict: print key, val_dict[key] 0.5 [[0, 4], [0, 5], [0, 6]] 1.0 [[1, 2], [1, 3], [1, 7], [2, 3], [2, 7], [3, 7], [4, 5], [8, 9]] 0.833333333333 [[0, 8], [0, 9], [1, 4], [1, 5], [1, 6], [1, 8], [1, 9], [2, 4], [2, 5], [2, 6], [2, 8], [2, 9], [3, 4], [3, 5], [3, 6], [3, 8], [3, 9], [4, 7], [5, 7], [6, 7], [7, 8], [7, 9]] 0.666666666667 [[0, 1], [0, 2], [0, 3], [0, 7], [4, 6], [4, 8], [4, 9], [5, 6], [5, 8], [5, 9], [6, 8], [6, 9]]

The final output is formated to be a dictionary, whose key is the Jaccard similarity, and value is a list of pairs that have this similarity. It is clear to see that pairs [1,2], [1,3], [1,7], [2,3], [2,7] and [3,7] do have Jaccard similarity equal to 1. This verifies that the assignment achieved by LSH is correct. However, it incorrectly assign rows 4, 5, 8 and 9 into the other bucket. Recall that it is a method that approximately allows us to find similar objects. Details on how to decrease false positive can be find here.

- Comments several Github implementations:

I found several Github implementations of MinHashing and LSH. After spending some time in looking into these codes, I can give some comments that might be useful.

(1) The scikit-learn LSHForest package, a simple user guide can be found here.

This package does not implement the conventional LSH method that I explained before. Instead, it implements an algorithm called LSHForest, which is independent from using parameters b and r. Basically, having k hash functions, it builds k trees, each is constructed from the hash values of the dataset. Similar objects (or say candidate pairs) are not searched in buckets but in these trees, by first a bottom-up search to find a maximal depth among trees, then a top-down search to find candidate pairs using this maximal depth. Details of this method can be found in this paper.

In my opinion, this algorithm is not very suitable to be rewritten into a MapReduce program, because I see that the bottom-up and the top-down searches are not trivial to implement using map and reduce functions. But what I want so far is to find something that can be done into a MapReduce program, so I moved to other implementations.

(2) mattdennewitz's LSH v.s. madeleineudell's LSH

Both follows the common steps: generating MinHash signatures, performing LSH to find buckets, and looking for candidate pairs in the bucket. However, these two implementations differ in how MinHashing signatures are generated. In my opinion, the former gives an easier method to do. A drawback of the former implementation is that it doesn't present any examples, so in the end, I wasn't sure what the input should be. Good thing is that the latter does give an example using a set of documents. Thanks to these two implementations, I came to understand MinHash and LSH and wrote the example explained above.

(3) MinHashing and CosineHashing

The former explains MinHashing in a simple and clear way, and the Github code it provides is really well commented. But it doesn't implement LSH.

The same for latter, thought the title contains LSH, but it doesn't implement LSH. Instead, it implements something like MinHashing but works for cosine similarity. Recall that MinHashing is to compute the Jaccard similarity.

- Conclusion: