Friday, February 26, 2016

How to run K-means clustering on iris dataset using pyspark on a Hadoop cluster through PyCharm and through Ubuntu terminal

I admit that the title is a bit long, but it well summarizes the content of this blog.

I have a cluster of five nodes, 1 master + 4 workers, OS of Ubuntu 14.04. Hadoop 2.6 and Spark-1.4.0 are installed on all nodes and properly configured. Python 2.7 and important packages are also installed on all nodes (instruction of installing necessary packages on Ubuntu, see here).

I access to the master node from my laptop via the Python IDE, PyCharm. In my previous blog, I described how to enable PyCharm to execute a .py program remotely on the master, see here). In this blog, let's run K-means clustering on iris dataset using pyspark on hdfs. We will go through:

(1) prepare dataset
(2) load dataset onto hdfs
(3) configure Kmeans.y, run and monitor on Spark Web UI

Here we go!

(1) Iris dataset is composed of 150 examples from 3 classes, described by 4 attributes. More details on this dataset can be found on UCI dataset repository. For my case, I need to trim the label column, and to keep only the data of four attributes. Ideally, the figure below is what I need as input:
To do this, I made a small .py program, which does what I needed.
 import numpy as np  
 from sklearn import datasets  
 iris = datasets.load_iris()  
 data = iris.data   
 dt = data.tolist()  
 import os.path  
 save_path = '/home/hduser/Documents/test_Spark'  
 completeName = os.path.join(save_path, "iris.txt")       
 file = open(completeName, "w")  
 for i in range(150):  
      file1.write(str(dt[i][0])+' '+str(dt[i][1])+' '+str(dt[i][2])+' '+str(dt[i][3])+'\n')  
 file.close()  
I used the python package "sklearn" to load the dataset. You can download the dataset from UCI repository and process it in many different ways.

I saved the dataset on master node, in the folder "test_Spark", which is synchronized with the "test_Spark" folder on my laptop.

(2) Next, let's copy the dataset "iris.txt" to hdfs. To do so, you have to turn on hdfs and make sure Hadoop works properly on the cluster. Hadoop file system shell can help us to make this work, simply type:

 $ hadoop fs -copyFromLocal ~/Documents/test_Spark/iris.txt /user/xwang  

/user/xywang is an existing folder on my hdfs. Change it to yours accordingly. To create a folder on hdfs, use
 $ hadoop fs -mkdir folder_name  
to view a folder content on hdfs, use
 $ hadoop fs -ls folder_name/*
to view a file on hdfs, use
 $ hadoop fs -cat path/file_name  
everything is just like using terminal commands on Ubuntu, just you need to add "hadoop fs" before.

By now, we should see "iris.txt" appears in the folder of "xywang" on my hdfs.

(3) If you have the same distribution of Spark as me, you can find a kmeans.py located in the example folder of Spark. Full path on my master node is:
 spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib  

The content of kmeans.py file is:
 from __future__ import print_function  
 import sys  
 import numpy as np  
 from pyspark import SparkContext  
 from pyspark.mllib.clustering import KMeans  
 def parseVector(line):  
   return np.array([float(x) for x in line.split(' ')])  
 if __name__ == "__main__":  
   if len(sys.argv) != 3:  
     print("Usage: kmeans <file> <k>", file=sys.stderr)  
     exit(-1)  
   sc = SparkContext(appName="KMeans")  
   lines = sc.textFile(sys.argv[1])  
   data = lines.map(parseVector)  
   k = int(sys.argv[2])  
   model = KMeans.train(data, k)  
   print("Final centers: " + str(model.clusterCenters))  
   print("Total Cost: " + str(model.computeCost(data)))  
   sc.stop()  

Two ways to execute this file: 1. through Ubuntu terminal or 2. through PyCharm

For 1. As I access to master via ssh, I first need to enable the communication between my laptop and the master node. Here it is more convenient if you can make it password-less. After connecting to the master node, on terminal, type:
 python ~/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib/kmeans.py "hdfs://master:9000/user/xywang/iris.txt" "3"  
or go to the "bin" folder of your Spark, and type
 ./pyspark ~/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib/kmeans.py "hdfs://master:9000/user/xywang/iris.txt" "3"  

if everything is fine, you should see results like:
 Final centers: [array([ 5.006, 3.418, 1.464, 0.244]), array([ 6.85, 3.07368421, 5.74210526, 2.07105263]), array([ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097])]  
 Total Cost: 78.9408414261    

But on Spark web UI, you won't see any job info, because this job is actually executed only on the master node. This trick is that, we need to add a parameter in "SparkContext()" to tell it that we want this job to be executed on the cluster:
sc = SparkContext(master="spark://IP_of_master:7077", appName="KMeans")  

Now run the command, you should see "Kmeans" in the "Running Applications" on Spark web UI when the job is running, and in "Completed Applications" when it is done. 

For 2 to run it via PyCharm. I created a new python file, named as "Kmeans.py"  in the project "test_Spark" on my laptop. To distinguish the "kmeans.py" in Spark example folder, I used capital letter "K".  I copied the code of "kmeans.py" to Kmeans.py, and modified "SparkContext" as said previously.

Next I need to sent a copy of this file to folder "test_Spark" on master, otherwise, an error will occur, saying that "python: can't open file 'Kmeans.py': [Errno 2] No such file or directory". Simply, right click Kmeans.py in PyCharm, and chose "upload to mini-cluster".

Then in the "terminal" tab of PyCharm, ssh to master node, and type the command to run Kmeans.py:
 $ python ~/Documents/test_Spark/Kmeans.py "hdfs://master:9000/user/xywang/iris.txt"  

Here is the screenshot, you can see the code, the command and the result in the terminal tab:
I use "appName="Kmeans1", after the job is done, it appeared in the "Complete Application" on my Spark Web UI shown as the previous screenshot.

Troubleshooting:
- the paths of $SPARK_HOME and PYTHONPATH should be changed accordingly, otherwise error of "no module of pyspark" will occur.
- Numpy should be installed in all nodes (master+workers).
- omitting master="spark://IP_of_master_node: 7077" will only run the job on master but not on cluster



38 comments:

  1. Useful post!!! For management and troubleshooting of enormous databases, corporation’s square measure searching for qualified and licensed dispersions Hadoop experts for the duty. Hadoop Training in Chennai | Hadoop Training Chennai

    ReplyDelete
  2. This is Great and very useful advice with in this post. Thank you.
    unix training in chennai

    ReplyDelete
  3. It's very best advantageous blogs.I read this subjects blog such a great blog and good sharing I'll be like this informative post.
    Thank you for selecting the time to provide us with your valuable knowledge. Dot Net Training in Chennai | .Net Training in Chennai | Dot Net Training in Chennai with Placement

    ReplyDelete
  4. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

    Java Training Institute Bangalore

    Best Java Training Institute Chennai

    ReplyDelete
  5. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this. Selenium Training In Bangalore | Best Selenium Training in Bangalore

    ReplyDelete
  6. We Develope final year projects and provide complete source code contact now

    Phd Research Project Centers in Chennai | Iot Projects in Chennai.

    ReplyDelete
  7. nice blog!
    https://www.besanttechnologies.com

    ReplyDelete
  8. Thanks for the good words! Really appreciated. Great post. I’ve been commenting a lot on a few blogs recently, but I hadn’t thought about my approach until you brought it up. 

    Hadoop Training in Chennai

    Hadoop Training in Bangalore

    Big data training in tambaram

    Big data training in Sholinganallur

    Big data training in annanagar

    ReplyDelete
  9. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.


    MEAN stack training in Chennai

    MEAN stack training in bangalore

    MEAN stack training in tambaram

    MEAN stack training in annanagar

    ReplyDelete
  10. Great post! I am actually getting ready to across this information, It’s very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.
    Devops Training in Chennai

    Devops Training in Bangalore

    ReplyDelete
  11. Thanks you for sharing this unique useful information content with us. Really awesome work. keep on blogging
    java training in omr | oracle training in chennai

    java training in annanagar | java training in chennai

    ReplyDelete
  12. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
    Data Science course in Chennai
    Data science course in bangalore
    Data science course in pune

    ReplyDelete
  13. Nice tutorial. Thanks for sharing the valuable information. it’s really helpful. Who want to learn this blog most helpful. Keep sharing on updated tutorials…
    python interview questions and answers | python tutorialspython course institute in electronic city

    ReplyDelete
  14. Does your blog have a contact page? I’m having problems locating it but, I’d like to shoot you an email.
    fire and safety course in chennai

    ReplyDelete

  15. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.


    AWS Training in chennai | Best AWS Course in Velachery,Chennai

    Best AWS Training in Chennai | AWS Training Institutes |Chennai,Velachery

    Amazon Web Services Training in Anna Nagar, Chennai |Best AWS Training in Chennai

    ReplyDelete
  16. I don’t have time to go through it all at the minute but I have saved it and also added in your RSS feeds, so when I have time I will be back to read more, Please do keep up the awesome job.
    health and safrety courses in chennai

    ReplyDelete
  17. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    Devops Training in Chennai | Devops Training Institute in Chennai

    ReplyDelete
  18. Hi, thank you very much for the new information, i learned something new. Very well written. It was soo good to read and usefull to improve knowledge. Keep posting. If you are looking for any Tableau related information please visit our website tableau training in bangalore.

    ReplyDelete
  19. Sharing the same interest, Infycle feels so happy to share our detailed information about all these courses with you all! Do check them out
    Big data training in chennai & get to know everything you want to about software trainings

    ReplyDelete
  20. Title:
    Learn Hadoop Training in Chennai | Infycle Technologies

    Description:
    If Big Data is a job that you're dreaming of, then we, Infycle are with you to make your dream into reality. Infycle Technologies offers the best Hadoop Training in Chennai, with various levels of highly demanded software courses such as Oracle, Java, Python, Big Data, etc., in 100% hands-on practical training with specialized tutors in the field. Along with that, the pre-interviews will be given for the candidates, so that, they can face the interviews with complete knowledge. To know more, dial 7502633633 for more.

    Best training in Chennai

    ReplyDelete



  21. Great to become visiting your weblog once more, it has been a very long time for me. Pleasantly this article i've been sat tight for such a long time. I will require this post to add up to my task in the school, and it has identical subject along with your review. Much appreciated, great offer. data science course in nagpur

    ReplyDelete
  22. Very Informative blog thank you for sharing. Keep sharing.

    Best software training institute in Chennai. Make your career development the best by learning software courses.

    power bi course in chennai
    rpa training in chennai
    blueprism training Chennai

    ReplyDelete
  23. Needed to compose you a very little word to thank you yet again
    regarding the nice suggestions you’ve contributed here.
    mysql training in chennai
    unix training in chennai

    ReplyDelete