I have already have Ipython notebook and Spark distribution 1.6 installed and configured in my PC (OS: Ubuntu). What I need to do is to (1) set environment variables and (2) configure Ipython profile.
For (1). Open terminal, and type two lines:
$ export SPARK_HOME="/home/xywang/spark-1.6.1-bin-hadoop2.6" $ export PYSPARK_SUBMIT_ARGS="--master local[2]"
The 1st line tells your system where is your Spark home directory (change it accordingly to your distribution), and the 2nd line defines that you will run a single mode.
For (2). In terminal, type:
$ ipython profile create pyspark
After this command, a directory " ~/.ipython/profile_pyspark/" will be created. Go to its subdirectory "startup" and created a file named "00-pyspark-setup.py", and fill in the following content:
As you might use different Spark distribution as me, you might need to change the name of "py4j-0.9-src.zip" and "Spark 1.X".
After (1) and (2), you can start your Ipython notebook with pyspark profile, simply type this command
$ ipython notebook --profile=pyspark
If everything is alright, you can run SparkContext in Ipython notebook. For example, you can try to load a textFile like this:
I feel using Ipython notebook is better than typing commands through terminal, it is easier for me to track my code and to copy/paste of course.
Credit to [1] and [2].
-------- Update for Spark-2.2.0 + Jupyter Notebook --------
An easier way to link pyspark with jupyter notebook is to add values to
PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS in ~/.profile such that:
$ echo "export PYSPARK_DRIVER_PYTHON=jupyter" >> .profile $ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile $ source .profile
Reference: [3]