I have a cluster of five nodes, 1 master + 4 workers, OS of Ubuntu 14.04. Hadoop 2.6 and Spark-1.4.0 are installed on all nodes and properly configured. Python 2.7 and important packages are also installed on all nodes (instruction of installing necessary packages on Ubuntu, see here).
I access to the master node from my laptop via the Python IDE, PyCharm. In my previous blog, I described how to enable PyCharm to execute a .py program remotely on the master, see here). In this blog, let's run K-means clustering on iris dataset using pyspark on hdfs. We will go through:
(1) prepare dataset
(2) load dataset onto hdfs
(3) configure Kmeans.y, run and monitor on Spark Web UI
Here we go!
(1) Iris dataset is composed of 150 examples from 3 classes, described by 4 attributes. More details on this dataset can be found on UCI dataset repository. For my case, I need to trim the label column, and to keep only the data of four attributes. Ideally, the figure below is what I need as input:
To do this, I made a small .py program, which does what I needed.
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
dt = data.tolist()
import os.path
save_path = '/home/hduser/Documents/test_Spark'
completeName = os.path.join(save_path, "iris.txt")
file = open(completeName, "w")
for i in range(150):
file1.write(str(dt[i][0])+' '+str(dt[i][1])+' '+str(dt[i][2])+' '+str(dt[i][3])+'\n')
file.close()
I used the python package "sklearn" to load the dataset. You can download the dataset from UCI repository and process it in many different ways.I saved the dataset on master node, in the folder "test_Spark", which is synchronized with the "test_Spark" folder on my laptop.
(2) Next, let's copy the dataset "iris.txt" to hdfs. To do so, you have to turn on hdfs and make sure Hadoop works properly on the cluster. Hadoop file system shell can help us to make this work, simply type:
$ hadoop fs -copyFromLocal ~/Documents/test_Spark/iris.txt /user/xwang
/user/xywang is an existing folder on my hdfs. Change it to yours accordingly. To create a folder on hdfs, use
$ hadoop fs -mkdir folder_name
to view a folder content on hdfs, use $ hadoop fs -ls folder_name/*
to view a file on hdfs, use $ hadoop fs -cat path/file_name
everything is just like using terminal commands on Ubuntu, just you need to add "hadoop fs" before.By now, we should see "iris.txt" appears in the folder of "xywang" on my hdfs.
(3) If you have the same distribution of Spark as me, you can find a kmeans.py located in the example folder of Spark. Full path on my master node is:
spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib
The content of kmeans.py file is:
from __future__ import print_function
import sys
import numpy as np
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kmeans <file> <k>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="KMeans")
lines = sc.textFile(sys.argv[1])
data = lines.map(parseVector)
k = int(sys.argv[2])
model = KMeans.train(data, k)
print("Final centers: " + str(model.clusterCenters))
print("Total Cost: " + str(model.computeCost(data)))
sc.stop()
Two ways to execute this file: 1. through Ubuntu terminal or 2. through PyCharm
For 1. As I access to master via ssh, I first need to enable the communication between my laptop and the master node. Here it is more convenient if you can make it password-less. After connecting to the master node, on terminal, type:
python ~/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib/kmeans.py "hdfs://master:9000/user/xywang/iris.txt" "3"
or go to the "bin" folder of your Spark, and type ./pyspark ~/spark-1.4.0-bin-hadoop2.6/examples/src/main/python/mllib/kmeans.py "hdfs://master:9000/user/xywang/iris.txt" "3"
if everything is fine, you should see results like:
Final centers: [array([ 5.006, 3.418, 1.464, 0.244]), array([ 6.85, 3.07368421, 5.74210526, 2.07105263]), array([ 5.9016129 , 2.7483871 , 4.39354839, 1.43387097])]
Total Cost: 78.9408414261
But on Spark web UI, you won't see any job info, because this job is actually executed only on the master node. This trick is that, we need to add a parameter in "SparkContext()" to tell it that we want this job to be executed on the cluster:
sc = SparkContext(
master="spark://IP_of_master:7077", appName="KMeans")
Now run the command, you should see "Kmeans" in the "Running Applications" on Spark web UI when the job is running, and in "Completed Applications" when it is done.
For 2 to run it via PyCharm. I created a new python file, named as "Kmeans.py" in the project "test_Spark" on my laptop. To distinguish the "kmeans.py" in Spark example folder, I used capital letter "K". I copied the code of "kmeans.py" to Kmeans.py, and modified "SparkContext" as said previously.
Next I need to sent a copy of this file to folder "test_Spark" on master, otherwise, an error will occur, saying that "python: can't open file 'Kmeans.py': [Errno 2] No such file or directory". Simply, right click Kmeans.py in PyCharm, and chose "upload to mini-cluster".
Then in the "terminal" tab of PyCharm, ssh to master node, and type the command to run Kmeans.py:
$ python ~/Documents/test_Spark/Kmeans.py "hdfs://master:9000/user/xywang/iris.txt"
Here is the screenshot, you can see the code, the command and the result in the terminal tab:
I use "appName="Kmeans1", after the job is done, it appeared in the "Complete Application" on my Spark Web UI shown as the previous screenshot.
Troubleshooting:
- the paths of $SPARK_HOME and PYTHONPATH should be changed accordingly, otherwise error of "no module of pyspark" will occur.
- Numpy should be installed in all nodes (master+workers).
- omitting master="spark://IP_of_master_node: 7077" will only run the job on master but not on cluster
Useful post!!! For management and troubleshooting of enormous databases, corporation’s square measure searching for qualified and licensed dispersions Hadoop experts for the duty. Hadoop Training in Chennai | Hadoop Training Chennai
ReplyDeletethanks for sharing
ReplyDeletebe projects in chennai
Useful post!!!
ReplyDeleteunixtraining in chennai
This is Great and very useful advice with in this post. Thank you.
ReplyDeleteunix training in chennai
thanks for sharing...
ReplyDeleteios training in chennai
good ....
ReplyDeleteunix training in chennai
It's very best advantageous blogs.I read this subjects blog such a great blog and good sharing I'll be like this informative post.
ReplyDeleteThank you for selecting the time to provide us with your valuable knowledge. Dot Net Training in Chennai | .Net Training in Chennai | Dot Net Training in Chennai with Placement
Great stuff! thank you very much!
ReplyDeleteI feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.
ReplyDeleteJava Training Institute Bangalore
Best Java Training Institute Chennai
Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this. Selenium Training In Bangalore | Best Selenium Training in Bangalore
ReplyDeleteThank you for useful tips!
ReplyDeleteMe project centers in chennai | Mtech project centers in chennai
We Develope final year projects and provide complete source code contact now
ReplyDeletePhd Research Project Centers in Chennai | Iot Projects in Chennai.
nice blog!
ReplyDeletehttps://www.besanttechnologies.com
Thanks for the good words! Really appreciated. Great post. I’ve been commenting a lot on a few blogs recently, but I hadn’t thought about my approach until you brought it up.
ReplyDeleteHadoop Training in Chennai
Hadoop Training in Bangalore
Big data training in tambaram
Big data training in Sholinganallur
Big data training in annanagar
Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
ReplyDeleteMEAN stack training in Chennai
MEAN stack training in bangalore
MEAN stack training in tambaram
MEAN stack training in annanagar
Great post! I am actually getting ready to across this information, It’s very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.
ReplyDeleteDevops Training in Chennai
Devops Training in Bangalore
Thanks you for sharing this unique useful information content with us. Really awesome work. keep on blogging
ReplyDeletejava training in omr | oracle training in chennai
java training in annanagar | java training in chennai
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
ReplyDeleteData Science course in Chennai
Data science course in bangalore
Data science course in pune
Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
ReplyDeleteangularjs Training in chennai
angularjs Training in chennai
angularjs-Training in tambaram
angularjs-Training in sholinganallur
angularjs-Training in velachery
Nice tutorial. Thanks for sharing the valuable information. it’s really helpful. Who want to learn this blog most helpful. Keep sharing on updated tutorials…
ReplyDeletepython interview questions and answers | python tutorialspython course institute in electronic city
Does your blog have a contact page? I’m having problems locating it but, I’d like to shoot you an email.
ReplyDeletefire and safety course in chennai
ReplyDeleteWhoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
AWS Training in chennai | Best AWS Course in Velachery,Chennai
Best AWS Training in Chennai | AWS Training Institutes |Chennai,Velachery
Amazon Web Services Training in Anna Nagar, Chennai |Best AWS Training in Chennai
I don’t have time to go through it all at the minute but I have saved it and also added in your RSS feeds, so when I have time I will be back to read more, Please do keep up the awesome job.
ReplyDeletehealth and safrety courses in chennai
Informative post, thanks for sharing.
ReplyDeleteRPA course in Chennai
RPA Training in Chennai
Blue Prism Training Chennai
Angular 6 Training in Chennai
AWS Certification in Chennai
Data Science course in Chennai
Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
ReplyDeleteDevops Training in Chennai | Devops Training Institute in Chennai
Hi, thank you very much for the new information, i learned something new. Very well written. It was soo good to read and usefull to improve knowledge. Keep posting. If you are looking for any Tableau related information please visit our website tableau training in bangalore.
ReplyDeletesuper blogs...!
ReplyDeleteinternship in chennai for ece students
internships in chennai for cse students 2019
Inplant training in chennai
internship for eee students
free internship in chennai
eee internship in chennai
internship for ece students in chennai
inplant training in bangalore for cse
inplant training in bangalore
ccna training in chennai
Updating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)
ReplyDeleteVisit SKARTEC
Click Here
SKARTEC Digital Marketing Academy
digital marketing course in chennai with placement
digital marketing training institute in chennai
digital marketing course near me
digital marketing course in chennai fees
best institute for digital marketing course in chennai
digital marketing course with placement
online digital marketing course in chennai
advance digital marketing course in chennai
digital marketing training institute near me
digital marketing course near me
digital marketing training in india
seo training
Awesome blog. the way of creating this blog is really very nice. every concept of this blog is really clear.
ReplyDeleteData Science Training Course In Chennai | Data Science Training Course In Anna Nagar | Data Science Training Course In OMR | Data Science Training Course In Porur | Data Science Training Course In Tambaram | Data Science Training Course In Velachery
It is a nice blog.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
i am learning this page very good information ...please keep to share more information!!!
ReplyDeleteAndroid Training in Chennai
Android Online Training in Chennai
Android Training in Bangalore
Android Training in Hyderabad
Android Training in Coimbatore
Android Training
Android Online Training
Sharing the same interest, Infycle feels so happy to share our detailed information about all these courses with you all! Do check them out
ReplyDeleteBig data training in chennai & get to know everything you want to about software trainings
Title:
ReplyDeleteLearn Hadoop Training in Chennai | Infycle Technologies
Description:
If Big Data is a job that you're dreaming of, then we, Infycle are with you to make your dream into reality. Infycle Technologies offers the best Hadoop Training in Chennai, with various levels of highly demanded software courses such as Oracle, Java, Python, Big Data, etc., in 100% hands-on practical training with specialized tutors in the field. Along with that, the pre-interviews will be given for the candidates, so that, they can face the interviews with complete knowledge. To know more, dial 7502633633 for more.
Best training in Chennai
ReplyDeleteGreat to become visiting your weblog once more, it has been a very long time for me. Pleasantly this article i've been sat tight for such a long time. I will require this post to add up to my task in the school, and it has identical subject along with your review. Much appreciated, great offer. data science course in nagpur
Very Informative blog thank you for sharing. Keep sharing.
ReplyDeleteBest software training institute in Chennai. Make your career development the best by learning software courses.
power bi course in chennai
rpa training in chennai
blueprism training Chennai
Needed to compose you a very little word to thank you yet again
ReplyDeleteregarding the nice suggestions you’ve contributed here.
mysql training in chennai
unix training in chennai
Yeni perde modelleri
ReplyDeleteNumara onay
mobil ödeme bozdurma
Nft nasil alinir
ankara evden eve nakliyat
trafik sigortası
dedektör
web sitesi kurma
Ask Romanlari
smm panel
ReplyDeletesmm panel
iş ilanları
instagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
servis
TİKTOK JETON HİLESİ İNDİR