Launching H2O cluster on different port in pysparkling

In this example we will launch H2O machine learning cluster using pysparkling package. You can visit my github and this article to learn more about the code execution explained in this article.

For you would need to  install pysparkling in python 2.7 setup as below:

> pip install -U h2o_pysparkling_2.1

Now we can launch the pysparkling Shell as below:


Launch pysparkling shell:

~/tools/sw2/sparkling-water-2.1.14 $ bin/pysparkling

Python Code Script Launch the H2O cluster in pysparkling:

## Importing Libraries
from pysparkling import *
import h2o

## Setting H2O Conf Object
h2oConf = H2OConf(sc)

## Setting H2O Conf for different port

## Gett H2O Conf Object to see the configuration

## Launching H2O Cluster
hc = H2OContext.getOrCreate(spark, h2oConf)

## Getting H2O Cluster status

Now If you verify the Sparkling Water configuration you will see that the H2O is running on the given IP and port 54300 as configured:

Sparkling Water configuration:
  backend cluster mode : internal
  workers              : None
  cloudName            : Not set yet, it will be set automatically before starting H2OContext.
  flatfile             : true
  clientBasePort       : 54300
  nodeBasePort         : 54300
  cloudTimeout         : 60000
  h2oNodeLog           : INFO
  h2oClientLog         : WARN
  nthreads             : -1
  drddMulFactor        : 10

Thats it, enjoy!!

Using Sparkling water and PySpark to log console output

Here is the command Option #1:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the command Option #2:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0,ai.h2o:sparkling-water-core_2.10:1.6.7 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

We must make sure that both h2o backend and python version are calling same Version of API.

This parameter is using H2O API backend version 1.6.7

This parameter is using 1.6.7 version of Python API:
–py-files /mnt/app/sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the script to test overall scenario:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> hc = H2OContext.getOrCreate(sc)