Monday, June 27, 2016

How to configure Ipython notebook on Spark single node mode

Programming in Ipython notebook offers much convenient thanks to its interactive interface. Bothered by writing code on Terminal in Pyspark, here I will show how to link Pyspark to Ipython notebook for Spark single node mode.

I have already have Ipython notebook and Spark distribution 1.6 installed and configured in my PC (OS: Ubuntu). What I need to do is to (1) set environment variables and (2) configure Ipython profile.

For (1). Open terminal, and type two lines:

$ export SPARK_HOME="/home/xywang/spark-1.6.1-bin-hadoop2.6"

$ export PYSPARK_SUBMIT_ARGS="--master local[2]"

The 1st line tells your system where is your Spark home directory (change it accordingly to your distribution), and the 2nd line defines that you will run a single mode.

For (2). In terminal, type:
$ ipython profile create pyspark

After this command, a directory " ~/.ipython/profile_pyspark/" will be created. Go to its subdirectory "startup" and created a file named "00-pyspark-setup.py", and fill in the following content:


As you might use different Spark distribution as me, you might need to change the name of "py4j-0.9-src.zip" and "Spark 1.X".

After (1) and (2), you can start your Ipython notebook with pyspark profile, simply type this command

$ ipython notebook --profile=pyspark

If everything is alright, you can run SparkContext in Ipython notebook. For example, you can try to load a textFile like this:



I feel using Ipython notebook is better than typing commands through terminal, it is easier for me to track my code and to copy/paste of course.

Credit to [1] and [2].

-------- Update for Spark-2.2.0 + Jupyter Notebook --------

An easier way to link pyspark with jupyter notebook is to add values to
PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS in ~/.profile such that:

$ echo "export PYSPARK_DRIVER_PYTHON=jupyter" >> .profile
$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile
$ source .profile
In this setting, typing "pyspark" will start Jupyter notebook.

Reference: [3]