Sunday, July 30, 2017

MinHashing and Locality Sensitive Hashing: a brief theory, an simple example and comments on several open-source implementations.


Locality sensitive search is often used in searching for similar objects in a large dataset. An example is to implement the K nearest neighbors (KNN) algorithm for big data.

For a small dataset, it is feasible to compute pairwise similarities or distances for all data instances, but for a large dataset, it is impossible. For example, given a dataset of N instances, if we compute the pairwise similarities, there will be N^2 values. The output similarity matrix is symmetric, even we only take the upper or the lower part of this matrix, there are N^(N-1)/2 values. When N is small, this computation is doable. However, if there are 1 million data instances, it will take 16 years to get all similarities, see here for detailed computation.

In fact, in most cases, we only care about data instances that are close to each other, thus large similarities or small distances are in our interest. As to those data instances who are far form each other, we usually don't need to waste time on them. So before we come to thoroughly compute the expensive pairwise similarities, it would be great if we could use some method to get an idea on which group of data instances might be similar to each other. Then we merely compute pairwise similarities of these data instances found in this group. In this case, we will save much time in looking for the similar objects. The method which is capable to smartly, rapidly but approximately find some groups of similar data instances, is the Locality Sensitive Hashing (LSH).

  • A brief theory: 

A detailed introduction of LSH can be found here. In brief,  the input data is firstly converted into signatures, which can represent the data. Then LSH assigns similar data instances into different "bucket" via their signatures. Depending on the similarity metric adopted to compute similarities, either the Jaccard similarity, the cosine similarity, or others, the way of generating dataset's signatures vary. This paper details more on this subject.

In searching similar web-pages or documents, the Jaccard similarity is commonly used. In this case, the MinHashing is then required to compute the signatures before LSH is applied. Some good illustration can be found in this video and in this article. In brief, given a binary set {0,1,1}. A hashing function will generate a digit for each element in this set. MinHashing requires several different hash functions, let's say these functions are h1, h2, h3, ..., h8. Each hash function will operation on every element in the set, the smallest hash value is selected as the signature. In our case, there are three elements in the set, h1 generates three hash values, the smallest is the signature. The same for h2, h3, ... h8.  In the end, there will be 4 signatures for this set. The pseudo-code is like follows:


MinHash(s) = min(h(x) for all x in s) # x: an element, s: a binary set

I think this process is quite similar to projecting the input dataset into another space. If the input dataset matrix is of shape n-by-m, n is the number of instances, m is the number of features. Given k hash functions, the signature matrix is of shape n-by-k. 

Once the signature matrix is obtained, it is time to apply LSH to assign data instances (objects) into different buckets. Each bucket contain several approximately similar objects. There is some probability knowledge behind this, for details, please see the text book of Mining Massive Datasets. Briefly, the first thing you should do is to separate the signatures into a few bands, each band contains an equal number of signatures. For example, in our simple example, we have 8 signatures output for the set of {0,1,1}. We can give 2 bands, thus each band contains 4 signatures. Next is to choose another function to hash all the signatures in each band. In return, two more hash values are generated, these can be considered as the index of buckets. When there are several sets instead of one, we will see a few of them collide into one bucket, some fall into the others. After this step, what is needed is to compute pairwise similarities in each bucket.
The output of LSH will be a "bucket" matrix of n-by-b.

  • An detailed small example: 

Well, after a lot of theory above, I hope a simple example can explain all. (The following example is written in Pyhon2.7)

Let's say we work on this binary dataset generated by:


import numpy as np

mat = np.random.choice([0,1], size=(10,6), p=(.1, .9)) # randomly generate a binary matrix, with probability=0.1 to generate 0. 
print mat

[[0 1 1 1 1 0]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 0 1 1 1]
 [1 1 0 1 1 1]
 [1 1 1 1 0 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 0]
 [1 1 1 1 1 0]]


This is a small binary data matrix of shape 10-by-6. And we have 8 hash functions, generated by:


def hash_factory(n):
    return lambda x: hash("salt" + unicode(n) + unicode(x) + "salt")

n_hashf = 8
hashes = [hash_factory(_) for _ in range(n_hashf)] # a list 8 hash functions

First step, we need to do is to generate the signature matrix for the input, this is done by:


def min_hash_sig(s, hf):
    """
    input a set s
    output it's minhash signature of a hash function
    """
    return min(hf(val) for val in s)

sig_ls = [] # a list of signatures, a list of 10 list, each list contain 8 MinHash signatures

for s in mat: # for each row in mat
    sig = map(lambda f: min_hash_sig(s,f), hashes) # compute MinHash signature
    sig_ls.append(sig)

print sig_ls[0] # print MinHash sigmatures for the 1st row in mat

[1180994143573573947, 8073458713907736788, 2087231956330381401, -2440012934056873006, -2872639764063778585, 8912896952121485472, 2926734194800130405, -8627865310335260834]

Second, define the number of bands, b, the number of signatures in each band, r, and apply LSH.


r = 4 # number of signatures in each band
b = 2 # number of bands
t = (1. / b) ** (1. / r) # t is related to the probability that the signatures all agree in at least one band
print "t:", t

lsh = []

for s in range(len(sig_ls)):
    sig = sig_ls[s]
    for i, band in enumerate(range(b)):
        lsh.append(hash(tuple(sig[i*r:i*r+r]))) # hash signatures in one band

lsh_arr = np.array(lsh).reshape(10,b) # convert lsh from a list to a array of shape n-by-b
print lsh_arr 
 
t: 0.840896415254
 

[[ 4687445064954392987  1930194277869531077]
 [-4502860058849100603 -3875320495035685115]
 [-4502860058849100603 -3875320495035685115]
 [-4502860058849100603 -3875320495035685115]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]
 [-4502860058849100603 -3875320495035685115]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]]

The lsh_arr as 10 rows and two columns, each maps to a row in the input binary matrix, each column maps to a band, it contains two different bucket indexes. From this input, we can see which bucket each data instance is assigned to. A better view can be achieved by:


from collections import defaultdict 
 
hashmap = [defaultdict(list) for i in range(b)]

for j in range(b):
    lsh_b = lsh_arr[:,j]
    n = lsh_arr.shape[0]
    for i in range(n):
        hashmap[j][lsh_b[i]].append(i) 
 
print hashmap

[defaultdict(<type 'list'>, {4687445064954392987: [0, 4, 5, 6, 8, 9], -4502860058849100603: [1, 2, 3, 7]}), defaultdict(<type 'list'>, {1930194277869531077: [0, 4, 5, 6, 8, 9], -3875320495035685115: [1, 2, 3, 7]})]


defaultdict is used here to generate a list of dictionaries, each dictionary has a bucket index as key, and a list of row indexes as value. If you don't know much about python defaultdict, here is a very nice explanation on it.

It is more clear to see that rows at indexes 1, 2, 3 and 7 (in the binary input matrix) are considered as similar, that's why they are assigned to the same bucket. And rows at indexes of 0, 4, 5, 6, 8, 9 are assigned to a different bucket.

With an objective, which is either to perform KNN or to find the closest pair, instead of computing 45 pairwise similarities (=10x(10-1)/2), we just need to compute 21 pairwise similarities (=4x(4-1)/2 + 6x(6-1)/2).

To verify the results, we can thouroughly compute the pairwise Jaccard similarities of the binary input matrix by:


# compute pairwise jaccard similarity, it outpouts n(n-1)/2 similarities

from sklearn.metrics import jaccard_similarity_score as js

val_dict = {}

n = mat.shape[0]
for r in range(n-1):
    for r_ in range(r+1, n):
        count +=1

        sim = js(mat[r], mat[r_])
        if sim not in val_dict:
            val_dict[sim] = [r, r_]
        else:
            val_dict[sim].append([r, r_])
            
for key in val_dict:
    print key, val_dict[key]

0.5 [[0, 4], [0, 5], [0, 6]]
1.0 [[1, 2], [1, 3], [1, 7], [2, 3], [2, 7], [3, 7], [4, 5], [8, 9]]
0.833333333333 [[0, 8], [0, 9], [1, 4], [1, 5], [1, 6], [1, 8], [1, 9], [2, 4], [2, 5], [2, 6], [2, 8], [2, 9], [3, 4], [3, 5], [3, 6], [3, 8], [3, 9], [4, 7], [5, 7], [6, 7], [7, 8], [7, 9]]
0.666666666667 [[0, 1], [0, 2], [0, 3], [0, 7], [4, 6], [4, 8], [4, 9], [5, 6], [5, 8], [5, 9], [6, 8], [6, 9]]

The final output is formated to be a dictionary, whose key is the Jaccard similarity, and value is a list of pairs that have this similarity. It is clear to see that pairs [1,2], [1,3], [1,7], [2,3], [2,7] and [3,7] do have Jaccard similarity equal to 1. This verifies that the assignment achieved by LSH is correct. However, it incorrectly assign rows 4, 5, 8 and 9 into the other bucket. Recall that it is a method that approximately allows us to find similar objects. Details on how to decrease false positive can be find here.

  • Comments several Github implementations:

I found several Github implementations of MinHashing and LSH. After spending some time in looking into these codes, I can give some comments that might be useful.

(1) The scikit-learn LSHForest package, a simple user guide can be found here.

This package does not implement the conventional LSH method that I explained before. Instead, it implements an algorithm called LSHForest, which is independent from using parameters b and r. Basically, having k hash functions,  it builds k trees, each is constructed from the hash values of the dataset. Similar objects (or say candidate pairs) are not searched in buckets but in these trees, by first a bottom-up search to find a maximal depth among trees, then a top-down search to find candidate pairs using this maximal depth. Details of this method can be found in this paper.

In my opinion, this algorithm is not very suitable to be rewritten into a MapReduce program, because I see that the bottom-up and the top-down searches are not trivial to implement using map and reduce functions. But what I want so far is to find something that can be done into a MapReduce program, so I moved to other implementations.

(2) mattdennewitz's LSH v.s. madeleineudell's LSH

Both follows the common steps: generating MinHash signatures, performing LSH to find buckets, and looking for candidate pairs in the bucket. However, these two implementations differ in how MinHashing signatures are generated. In my opinion, the former gives an easier method to do. A drawback of the former implementation is that it doesn't present any examples, so in the end, I wasn't sure what the input should be. Good thing is that the latter does give an example using a set of documents. Thanks to these two implementations, I came to understand MinHash and LSH and wrote the example explained above.

(3) MinHashing and CosineHashing

The former explains MinHashing in a simple and clear way, and the Github code it provides is really well commented. But it doesn't implement LSH.

The same for latter, thought the title contains LSH, but it doesn't implement LSH. Instead, it implements something like MinHashing but works for cosine similarity. Recall that MinHashing is to compute the Jaccard similarity.

  • Conclusion: 
On my way of trying to understand how LSH exactly works and how to make it work for my own use case, I spent some time in studying the theory and in examining several open-source implementations. Now I have a better understanding on this subject, and I think it might be useful to share what I've learned with you, who are interested in this topic. I hope it helps.







Monday, June 27, 2016

How to configure Ipython notebook on Spark single node mode

Programming in Ipython notebook offers much convenient thanks to its interactive interface. Bothered by writing code on Terminal in Pyspark, here I will show how to link Pyspark to Ipython notebook for Spark single node mode.

I have already have Ipython notebook and Spark distribution 1.6 installed and configured in my PC (OS: Ubuntu). What I need to do is to (1) set environment variables and (2) configure Ipython profile.

For (1). Open terminal, and type two lines:

$ export SPARK_HOME="/home/xywang/spark-1.6.1-bin-hadoop2.6"

$ export PYSPARK_SUBMIT_ARGS="--master local[2]"

The 1st line tells your system where is your Spark home directory (change it accordingly to your distribution), and the 2nd line defines that you will run a single mode.

For (2). In terminal, type:
$ ipython profile create pyspark

After this command, a directory " ~/.ipython/profile_pyspark/" will be created. Go to its subdirectory "startup" and created a file named "00-pyspark-setup.py", and fill in the following content:


As you might use different Spark distribution as me, you might need to change the name of "py4j-0.9-src.zip" and "Spark 1.X".

After (1) and (2), you can start your Ipython notebook with pyspark profile, simply type this command

$ ipython notebook --profile=pyspark

If everything is alright, you can run SparkContext in Ipython notebook. For example, you can try to load a textFile like this:



I feel using Ipython notebook is better than typing commands through terminal, it is easier for me to track my code and to copy/paste of course.

Credit to [1] and [2].

-------- Update for Spark-2.2.0 + Jupyter Notebook --------

An easier way to link pyspark with jupyter notebook is to add values to
PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS in ~/.profile such that:

$ echo "export PYSPARK_DRIVER_PYTHON=jupyter" >> .profile
$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile
$ source .profile
In this setting, typing "pyspark" will start Jupyter notebook.

Reference: [3]