Sunday, July 30, 2017

MinHashing and Locality Sensitive Hashing: a brief theory, an simple example and comments on several open-source implementations.

Locality sensitive search is often used in searching for similar objects in a large dataset. An example is to implement the K nearest neighbors (KNN) algorithm for big data.

For a small dataset, it is feasible to compute pairwise similarities or distances for all data instances, but for a large dataset, it is impossible. For example, given a dataset of N instances, if we compute the pairwise similarities, there will be N^2 values. The output similarity matrix is symmetric, even we only take the upper or the lower part of this matrix, there are N^(N-1)/2 values. When N is small, this computation is doable. However, if there are 1 million data instances, it will take 16 years to get all similarities, see here for detailed computation.

In fact, in most cases, we only care about data instances that are close to each other, thus large similarities or small distances are in our interest. As to those data instances who are far form each other, we usually don't need to waste time on them. So before we come to thoroughly compute the expensive pairwise similarities, it would be great if we could use some method to get an idea on which group of data instances might be similar to each other. Then we merely compute pairwise similarities of these data instances found in this group. In this case, we will save much time in looking for the similar objects. The method which is capable to smartly, rapidly but approximately find some groups of similar data instances, is the Locality Sensitive Hashing (LSH).

  • A brief theory: 

A detailed introduction of LSH can be found here. In brief,  the input data is firstly converted into signatures, which can represent the data. Then LSH assigns similar data instances into different "bucket" via their signatures. Depending on the similarity metric adopted to compute similarities, either the Jaccard similarity, the cosine similarity, or others, the way of generating dataset's signatures vary. This paper details more on this subject.

In searching similar web-pages or documents, the Jaccard similarity is commonly used. In this case, the MinHashing is then required to compute the signatures before LSH is applied. Some good illustration can be found in this video and in this article. In brief, given a binary set {0,1,1}. A hashing function will generate a digit for each element in this set. MinHashing requires several different hash functions, let's say these functions are h1, h2, h3, ..., h8. Each hash function will operation on every element in the set, the smallest hash value is selected as the signature. In our case, there are three elements in the set, h1 generates three hash values, the smallest is the signature. The same for h2, h3, ... h8.  In the end, there will be 4 signatures for this set. The pseudo-code is like follows:

MinHash(s) = min(h(x) for all x in s) # x: an element, s: a binary set

I think this process is quite similar to projecting the input dataset into another space. If the input dataset matrix is of shape n-by-m, n is the number of instances, m is the number of features. Given k hash functions, the signature matrix is of shape n-by-k. 

Once the signature matrix is obtained, it is time to apply LSH to assign data instances (objects) into different buckets. Each bucket contain several approximately similar objects. There is some probability knowledge behind this, for details, please see the text book of Mining Massive Datasets. Briefly, the first thing you should do is to separate the signatures into a few bands, each band contains an equal number of signatures. For example, in our simple example, we have 8 signatures output for the set of {0,1,1}. We can give 2 bands, thus each band contains 4 signatures. Next is to choose another function to hash all the signatures in each band. In return, two more hash values are generated, these can be considered as the index of buckets. When there are several sets instead of one, we will see a few of them collide into one bucket, some fall into the others. After this step, what is needed is to compute pairwise similarities in each bucket.
The output of LSH will be a "bucket" matrix of n-by-b.

  • An detailed small example: 

Well, after a lot of theory above, I hope a simple example can explain all. (The following example is written in Pyhon2.7)

Let's say we work on this binary dataset generated by:

import numpy as np

mat = np.random.choice([0,1], size=(10,6), p=(.1, .9)) # randomly generate a binary matrix, with probability=0.1 to generate 0. 
print mat

[[0 1 1 1 1 0]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 0 1 1 1]
 [1 1 0 1 1 1]
 [1 1 1 1 0 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 0]
 [1 1 1 1 1 0]]

This is a small binary data matrix of shape 10-by-6. And we have 8 hash functions, generated by:

def hash_factory(n):
    return lambda x: hash("salt" + unicode(n) + unicode(x) + "salt")

n_hashf = 8
hashes = [hash_factory(_) for _ in range(n_hashf)] # a list 8 hash functions

First step, we need to do is to generate the signature matrix for the input, this is done by:

def min_hash_sig(s, hf):
    input a set s
    output it's minhash signature of a hash function
    return min(hf(val) for val in s)

sig_ls = [] # a list of signatures, a list of 10 list, each list contain 8 MinHash signatures

for s in mat: # for each row in mat
    sig = map(lambda f: min_hash_sig(s,f), hashes) # compute MinHash signature

print sig_ls[0] # print MinHash sigmatures for the 1st row in mat

[1180994143573573947, 8073458713907736788, 2087231956330381401, -2440012934056873006, -2872639764063778585, 8912896952121485472, 2926734194800130405, -8627865310335260834]

Second, define the number of bands, b, the number of signatures in each band, r, and apply LSH.

r = 4 # number of signatures in each band
b = 2 # number of bands
t = (1. / b) ** (1. / r) # t is related to the probability that the signatures all agree in at least one band
print "t:", t

lsh = []

for s in range(len(sig_ls)):
    sig = sig_ls[s]
    for i, band in enumerate(range(b)):
        lsh.append(hash(tuple(sig[i*r:i*r+r]))) # hash signatures in one band

lsh_arr = np.array(lsh).reshape(10,b) # convert lsh from a list to a array of shape n-by-b
print lsh_arr 
t: 0.840896415254

[[ 4687445064954392987  1930194277869531077]
 [-4502860058849100603 -3875320495035685115]
 [-4502860058849100603 -3875320495035685115]
 [-4502860058849100603 -3875320495035685115]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]
 [-4502860058849100603 -3875320495035685115]
 [ 4687445064954392987  1930194277869531077]
 [ 4687445064954392987  1930194277869531077]]

The lsh_arr as 10 rows and two columns, each maps to a row in the input binary matrix, each column maps to a band, it contains two different bucket indexes. From this input, we can see which bucket each data instance is assigned to. A better view can be achieved by:

from collections import defaultdict 
hashmap = [defaultdict(list) for i in range(b)]

for j in range(b):
    lsh_b = lsh_arr[:,j]
    n = lsh_arr.shape[0]
    for i in range(n):
print hashmap

[defaultdict(<type 'list'>, {4687445064954392987: [0, 4, 5, 6, 8, 9], -4502860058849100603: [1, 2, 3, 7]}), defaultdict(<type 'list'>, {1930194277869531077: [0, 4, 5, 6, 8, 9], -3875320495035685115: [1, 2, 3, 7]})]

defaultdict is used here to generate a list of dictionaries, each dictionary has a bucket index as key, and a list of row indexes as value. If you don't know much about python defaultdict, here is a very nice explanation on it.

It is more clear to see that rows at indexes 1, 2, 3 and 7 (in the binary input matrix) are considered as similar, that's why they are assigned to the same bucket. And rows at indexes of 0, 4, 5, 6, 8, 9 are assigned to a different bucket.

With an objective, which is either to perform KNN or to find the closest pair, instead of computing 45 pairwise similarities (=10x(10-1)/2), we just need to compute 21 pairwise similarities (=4x(4-1)/2 + 6x(6-1)/2).

To verify the results, we can thouroughly compute the pairwise Jaccard similarities of the binary input matrix by:

# compute pairwise jaccard similarity, it outpouts n(n-1)/2 similarities

from sklearn.metrics import jaccard_similarity_score as js

val_dict = {}

n = mat.shape[0]
for r in range(n-1):
    for r_ in range(r+1, n):
        count +=1

        sim = js(mat[r], mat[r_])
        if sim not in val_dict:
            val_dict[sim] = [r, r_]
            val_dict[sim].append([r, r_])
for key in val_dict:
    print key, val_dict[key]

0.5 [[0, 4], [0, 5], [0, 6]]
1.0 [[1, 2], [1, 3], [1, 7], [2, 3], [2, 7], [3, 7], [4, 5], [8, 9]]
0.833333333333 [[0, 8], [0, 9], [1, 4], [1, 5], [1, 6], [1, 8], [1, 9], [2, 4], [2, 5], [2, 6], [2, 8], [2, 9], [3, 4], [3, 5], [3, 6], [3, 8], [3, 9], [4, 7], [5, 7], [6, 7], [7, 8], [7, 9]]
0.666666666667 [[0, 1], [0, 2], [0, 3], [0, 7], [4, 6], [4, 8], [4, 9], [5, 6], [5, 8], [5, 9], [6, 8], [6, 9]]

The final output is formated to be a dictionary, whose key is the Jaccard similarity, and value is a list of pairs that have this similarity. It is clear to see that pairs [1,2], [1,3], [1,7], [2,3], [2,7] and [3,7] do have Jaccard similarity equal to 1. This verifies that the assignment achieved by LSH is correct. However, it incorrectly assign rows 4, 5, 8 and 9 into the other bucket. Recall that it is a method that approximately allows us to find similar objects. Details on how to decrease false positive can be find here.

  • Comments several Github implementations:

I found several Github implementations of MinHashing and LSH. After spending some time in looking into these codes, I can give some comments that might be useful.

(1) The scikit-learn LSHForest package, a simple user guide can be found here.

This package does not implement the conventional LSH method that I explained before. Instead, it implements an algorithm called LSHForest, which is independent from using parameters b and r. Basically, having k hash functions,  it builds k trees, each is constructed from the hash values of the dataset. Similar objects (or say candidate pairs) are not searched in buckets but in these trees, by first a bottom-up search to find a maximal depth among trees, then a top-down search to find candidate pairs using this maximal depth. Details of this method can be found in this paper.

In my opinion, this algorithm is not very suitable to be rewritten into a MapReduce program, because I see that the bottom-up and the top-down searches are not trivial to implement using map and reduce functions. But what I want so far is to find something that can be done into a MapReduce program, so I moved to other implementations.

(2) mattdennewitz's LSH v.s. madeleineudell's LSH

Both follows the common steps: generating MinHash signatures, performing LSH to find buckets, and looking for candidate pairs in the bucket. However, these two implementations differ in how MinHashing signatures are generated. In my opinion, the former gives an easier method to do. A drawback of the former implementation is that it doesn't present any examples, so in the end, I wasn't sure what the input should be. Good thing is that the latter does give an example using a set of documents. Thanks to these two implementations, I came to understand MinHash and LSH and wrote the example explained above.

(3) MinHashing and CosineHashing

The former explains MinHashing in a simple and clear way, and the Github code it provides is really well commented. But it doesn't implement LSH.

The same for latter, thought the title contains LSH, but it doesn't implement LSH. Instead, it implements something like MinHashing but works for cosine similarity. Recall that MinHashing is to compute the Jaccard similarity.

  • Conclusion: 
On my way of trying to understand how LSH exactly works and how to make it work for my own use case, I spent some time in studying the theory and in examining several open-source implementations. Now I have a better understanding on this subject, and I think it might be useful to share what I've learned with you, who are interested in this topic. I hope it helps.

Saturday, July 15, 2017

Install Hadoop standalone and pseudo-distributed modes on Ubuntu 16.04 without and with Yarn, run with examples.

I configured Hadoop on a cluster of nodes and on my own PC more than a year ago. But recently the OS crashed on my PC, so I had to redo the installation. I think it might be helpful for me to write down the from-beginning-to-end process, just in case the same situation happens again, and I won't have to surf on-line to look for difference tutorials and spend time trying them for another time. And it might also be time-saving for someone who is interested in installing Hadoop but bothers to spend time in reviewing different webpages.

1. Download Hadoop

First of all, download Hadoop. For each release, you can find binary version and source version. I guess the source version is for Hadoop developers, who can modify the source and compile it by themselves. For me, I only need to use Hadoop, so I chose the binary version. By following the link, I downloaded the "hadoop-2.8.0.tar.gz", then using command line to extract it

$ tar -xzvf hadoop-2.7.3.tar.gz

Then a folder named as "hadoop-2.8.0" will appear in the same directory. Move this folder to /usr/local, where stores locally installed applications, by
$ sudo mv hadoop-2.8.0 /usr/local/hadoop

2. Install Java (if you don't have it)

At this step, I need to specify the JAVA_HOME for Hadoop. If you don't have java, install it by
$ sudo apt-get update
$ sudo apt-get install default-jdk

Check the version of java, use
$ java -version

For Hadoop, open file "/usr/local/hadoop/etc/hadoop/" by
$ sudo nano /usr/local/hadoop/etc/hadoop/

3. Set up JAVA_HOME for Hadoop

Specifying the JAVA_HOME for Hadoop by
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
According to this blog, this is a dynamic way to set up JAVA_HOME.

If all goes well, you should be able to run
$ /usr/local/hadoop/bin/hadoop

But it would be easier, if we can just put "hadoop" instead of this long command, to do so, you need to go to ~/.bashrc and add a line in it
# add hadoop 
export PATH=$PATH:/usr/local/hadoop/bin/

Restart the terminal, then type "hadoop" to see if it works. At this moment,  HDFS and Yarn are not configured yet, but Hadoop can run a map-reduce job already, it is the basic standalone mode.

4. Run an example on the standalone mode

To test if it works fine, an example can be found here. Simply speaking, it runs a provided example of regular expression to find how many times the word "principal" has occurred in the Hadoop's configuration files. First, copy the files into a folder named as "input"
$ mkdir ~/input
$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

Then run the program "grep" in the example jar file
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep ~/input ~/grep_example 'principal[.]*'
In this command, the explicit path of the jar file need to be given, as well as the input and output path. In fact, I didn't create the output folder yet, but it was created automatically when the result was computed. After the running, you can find the folder "grep_example" is in your home directory. 

The terminal doesn't show results instead of running process, to see the results, type
$ cat ~/grep_example/*  

5. Configure the pseudo-distributed on a single machine

Till now, all the above explained is about the standalone mode of Hadoop, aka, HDFS isn't set up. On a single machine, we can set up a pseudo-distributed mode, where each Hadoop daemon runs in a separate Java process. To do so, the core-site.xml and the hdfs-site.xml files located in "/usr/local/hadoop/etc/hadoop" have to be modified to:

For core-site.xml

for hdfs-site.xml
Here it means that Hadoop keeps 1 replica on its HDFS. You can set it to be 2 or more, but it isn't necessary as it runs on one machine.

6. Set up Web interface

In order to see the Web interface of HDFS or of YARN, it is necessary to have ssh and sshd, so that the machine (my PC) is able to visit itself by "ssh localhost".
Actually I have ssh, but I didn't have sshd, and the connection was refused. So I installed OpenSSH server to have sshd, it solved my problem.

To visit localhost by ssh, we need a pair of keys, which can be generated by
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_localhost
I renamed the keys as "id_rsa_localhost" to avoid conflict with other keys.

Then I copied the public key to the authorized_keys file by
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys

You can check your rights on these files using
$ ls -l

Normally you should have write and read rights on the authorized_keys file, if you don't use this command:
$ chmod 0600 ~/.ssh/authorized_keys

I'm often confused by the number after chmod, and I found this summary is great!

Since HDFS is set up by editing the core-site.xml and the hdfs-site.xml files, it is time to format the namenode and to start the hdfs, by
$ bin/hdfs namenode -format
$ sbin/  

If all goes well, we can visit "http://localhost:50070/", like

7. Run an example on the pseudo-distributed mode

It implies that HDFS runs normally. At this point, we can create folder on HDFS and run a map-reduce job on HDSF. A few commands of HDFS can be useful in doing so. They are a lot like the terminal commands on Ubuntu, but they work on a distributed file system, for example

$ hdfs dfs -ls 
this command shows existing HDFS folders.

To run the same example as previously illustrated. Let's first create a folder names as hdfs_input on HDFS and move the content in folder ~/input to hdfs_input, this can be done by

$ hdfs dfs -mkdir hdfs_input
$ hdfs dfs -put ~/input/* hdfs_input

If all goes well, we can get these:

Now we can run the same example on these HDFS files, by

$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep hdfs_input hdfs_output 'principal[.]*'
"hdfs_output" is the output folder on hdfs, it is not created beforehand, but will be automatically created once the map-reduce job is done.

To see the results, run
$ hdfs dfs -cat hdfs_output/*

Till now YARN hasn't be set up yet. To do so, we need to stop all the running services by

$ sbin/  

To set up Yarn, the "mapred-site.xml" and the "yarn-site.xml" files in folder "/usr/local/hadoop/etc/hadoop" need to be edited to be

For mapred-site.xml

for yarn-site.xml

Then we can start the HDFS and Yarn service by
$ sbin/  

After Yarn is started, by visit "http://localhost:8088/" we can see a webpage like
With Yarn is on, we can run the same example using the "hdfs_input" folder to check how Yarn rearrange resources in the map-reduce job. Note that as this example has been run before, we need to delete the "hdfs_output" folder, otherwise, the same command will encounter error as the same output folder already exists.

To stop Yarn, use "$ sbin/" command. 

Some remarks:

In this post, I explained how to install Hadoop and set it up for the standalone mode and for the pseudo-distributed mode (with HDFS on), then I explained how to set up Yarn. Each setting up is provided with a simple example.
Though I'm not new to this process, but I still learned something new from this installation. I write these down in order to have a quick reference for my later use, and to maybe provide some help for newbies who are interested in playing Hadoop.

- Melissa Anderson's article 
- Apache Hadoop Documentation