Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters.
With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs.
Here are the steps using Pyspakling with YARN on a hadoop cluster.
1. Get the latest build of sparkling water from here
2. Download and unzip the correct Sparkling Water version comparable with the Spark version into one of the edge node in your Hadoop cluster.
3. Set the following environment variables to the right path before running services:
export MASTER=”yarn-client” // To submit to the Yarn cluster
export SPARK_HOME=“path_to_the_directory_where_spark_unzipped”
export HADOOP_CONF_DIR=“path_to_the_hadoop_installation”export SPARK_SUBMIT_OPTIONS=”–packages ai.h2o:sparkling-water-examples_2.11:2.1.0”
export PYTHONPATH=“_path_to_where_python_installed”
export SPARKLING_EGG=$(ls -t /sparkling-water-2.1.0/py/build/dist/h2o_pysparkling*.egg | head -1)
//path to the Sparkling egg file needs to be updated above
Please make sure to check above version values to reflect the following:
  • 2.11-> refers to the scala version.
  • 2.1.0 —> refers to the spark version.
4. Set the “spark.executor.memory 4g” in Zeppelin either in the configuration file or in the Zeppelin UI if Error 143 is seen while starting the zeppelin server.
Note: To configure it in the Zeppelin UI, goto the dropdown next to the user at theTop right corner , select Interpreters and in the Spark section either edit or add the configuration.
5. Start the Zeppelin server using the command below. This would start Zeppelin in a Yarn container.
bin/ -Pspark-2.1
6. In Zeppelin notebook, create a new note with the markdown as below and add the path to the egg file. This will add the dependency and the classes of pysparkling.
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
7. Now, one can start calling pysparkling API’s like below:
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o hc = H2OContext.getOrCreate(sc)
8. To use the scala Sparkling water, one does not need to add dependency explicitly in the note in Zeppelin. A sample script would look like


import org.apache.spark.h2o._
val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
val h2oContext = H2OContext.getOrCreate(sc)
Thats all, enjoy!!

Building Hadoop Source in OSX

Step 1. Select your desired Hadoop Branch from a list below:

Step 2. Use svn to checkout and download source from the branch i.e.

$ svn co hadoop-2.0.5

Note: Above command will download Hadoop Branch 2.0.5 Alpha source code to a folder name hadoop-2.0.5.

Step 3: Change your current folder to hadoop-2.0.5 folder which will be considered as Hadoop source root folder.

Step 4:  Now open pom.xml and verify hadoop-main version as below to make sure this is the branch your are targeting to build for:


Step 5: Now open BUILDING.txt file and put your attention at requirement as described below:

* JDK 1.6
* Maven 3.0
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.4.1+ (for MapReduce and HDFS)
* CMake 2.6 or newer (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)

Step 6 : Make sure you do have everything needed in step 5 and if now use the info below to install required components:

  • Maven 3.0.4 works fine
  • For ProtocolBuffer just download it from here
  • $ ./configure
  • $ make
  • $ make install
  • For CMake you can use brew on OSX
  • $ brew install cmake

Step 7: Now be at your Hadoop source root and run the following commands in order to compile source, and build package

  •  $ mvn -version
  •  $ mvn clean
  •  $ mvn install  -DskipTests
  •  $ mvn compile  -DskipTests
  •  $ mvn package  -DskipTests
  •  $ mvn package -Pdist -DskipTests -Dtar

Now you can dive into hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/bin folder and run the Hadoop commands i.e. hadoop, hdfs, mapred etc as below:

~/work/hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/bin$ ./hadoop version
Hadoop 2.0.5-alpha
Subversion -r 1511192
Compiled by hadoopworld on 2013-08-07T07:01Z
From source with checksum c8f4bd45ac25c31b815f311b32ef17
This command was run using ~/work/hadoop-2.0.5/hadoop-dist/target/hadoop-2.0.5-alpha/share/hadoop/common/hadoop-common-2.0.5-alpha.jar