Setting H2O FLOW directory path in Sparkling Water

Sometimes you may want to back up H2O FLOW files to some source code repo or to a backup location. For that reason you may want to change the default FLOW directory.

In H2O flag -flow_dir is used to set the local folder for FLOW files.

Note: You can always specify any H2O property by using system properties on Spark driver/executors.

So to change H2O FLOW directory to save you can append to your command line with the Sparkling Water commandline:

--conf spark.driver.extraJavaOptions="-Dai.h2o.flow_dir=/your/backup/location"

 

Thats it, thanks.

Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Preduction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!

Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters.
With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs.
Here are the steps using Pyspakling with YARN on a hadoop cluster.
1. Get the latest build of sparkling water from here
2. Download and unzip the correct Sparkling Water version comparable with the Spark version into one of the edge node in your Hadoop cluster.
3. Set the following environment variables to the right path before running services:
export MASTER=”yarn-client” // To submit to the Yarn cluster
export SPARK_HOME=“path_to_the_directory_where_spark_unzipped”
export HADOOP_CONF_DIR=“path_to_the_hadoop_installation”export SPARK_SUBMIT_OPTIONS=”–packages ai.h2o:sparkling-water-examples_2.11:2.1.0”
export PYTHONPATH=“_path_to_where_python_installed”
export SPARKLING_EGG=$(ls -t /sparkling-water-2.1.0/py/build/dist/h2o_pysparkling*.egg | head -1)
//path to the Sparkling egg file needs to be updated above
Please make sure to check above version values to reflect the following:
  • 2.11-> refers to the scala version.
  • 2.1.0 —> refers to the spark version.
4. Set the “spark.executor.memory 4g” in Zeppelin either in the configuration file or in the Zeppelin UI if Error 143 is seen while starting the zeppelin server.
Note: To configure it in the Zeppelin UI, goto the dropdown next to the user at theTop right corner , select Interpreters and in the Spark section either edit or add the configuration.
5. Start the Zeppelin server using the command below. This would start Zeppelin in a Yarn container.
bin/zeppelin.sh -Pspark-2.1
6. In Zeppelin notebook, create a new note with the markdown as below and add the path to the egg file. This will add the dependency and the classes of pysparkling.
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
7. Now, one can start calling pysparkling API’s like below:
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o hc = H2OContext.getOrCreate(sc)
8. To use the scala Sparkling water, one does not need to add dependency explicitly in the note in Zeppelin. A sample script would look like

%spark

import org.apache.spark.h2o._
sc.version
val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
val h2oContext = H2OContext.getOrCreate(sc)
Thats all, enjoy!!

Sparkling Water 2.0 Walkthrough with pysparkling

My ENV:

SPARK_HOME=/Users/avkashchauhan/tools/spark-2.0.1-bin-hadoop2.6
 H2O_HOME=/Users/avkashchauhan/src/github.com/h2oai/h2o-3
 MASTER=local-cluster[3,2,1024]

Pysparkling Command:

$$> bin/pysparkling --num-executors 2 --executor-memory 2g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false
 Python 2.7.10 (default, Jul 30 2016, 18:31:42)
 [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel).
 16/10/20 09:29:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.0.1
 /_/
 
 Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
 SparkSession available as 'spark'.

Now entering Commands:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> sqlContext

>>> hc = H2OContext.getOrCreate(sc)

Here is the successful output:

 16/10/20 09:31:10 WARN InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
 16/10/20 09:31:10 WARN InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
 We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
 Warning: if you don't want to start local H2O server, then use of `h2o.connect()` is preferred.
 Checking whether there is an H2O instance running at http://10.0.0.63:54327. connected.
 -------------------------- ----------------------------------------
 H2O cluster uptime: 09 secs
 H2O cluster version: 3.10.0.7
 H2O cluster version age: 1 month
 H2O cluster name: sparkling-water-avkashchauhan_2132345410
 H2O cluster total nodes: 3
 H2O cluster free memory: 2.364 Gb
 H2O cluster total cores: 24
 H2O cluster allowed cores: 24
 H2O cluster status: accepting new members, healthy
 H2O connection url: http://10.0.0.63:54327
 H2O connection proxy:
 Python version: 2.7.10 final
 -------------------------- ----------------------------------------

Now verifying sparkling water package and make sure you have pysparkling reference to 2.0_2.0.0 package above.

>> h2o

<module ‘h2o’ from ‘/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/avkashchauhan/spark/work/spark- 28af708d-a149-435a-9a53-41e63d9ba7f5/userFiles-cf7c7aaf-610f-439d-9037- 2dcddca73524/h2o_pysparkling_2.0-2.0.0-py2.7.egg/h2o/init.pyc’>

Getting Help for h2o:

>> help(h2o)

Getting Cluster Status

>> h2o.cluster_status()

Using Sparkling water and PySpark to log console output

Here is the command Option #1:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the command Option #2:

./pyspark --deploy-mode client --conf spark.dynamicAllocation.enabled=false --packages com.databricks:spark-csv_2.11:1.4.0,ai.h2o:sparkling-water-core_2.10:1.6.7 --py-files ../../sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

We must make sure that both h2o backend and python version are calling same Version of API.

This parameter is using H2O API backend version 1.6.7
ai.h2o:sparkling-water-core_2.10:1.6.7

This parameter is using 1.6.7 version of Python API:
–py-files /mnt/app/sparkling-water-1.6.7/py/dist/h2o_pysparkling_1.6-1.6.7-py2.7.egg

Here is the script to test overall scenario:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> hc = H2OContext.getOrCreate(sc)

Sparkling Water – Tips and Tricks

You must set SPARK_HOME to proper spark version you would want to use:

$ export SPARK_HOME=/home/ec2-user/spark-1.6.2-bin-hadoop2.6

This is how you will launch spark shell:

$ bin/spark-shell or [Anywhere] $SPARK_HOME/bin/spark-shell

Once above command is successful, you will see the following stdout:

----- Spark master (MASTER) : local[*] Spark home (SPARK_HOME) : /home/ec2-user/spark-1.6.2-bin-hadoop2.6 H2O build version : 3.10.0.6 (turing) Spark build version : 1.6.2 ----

16/09/20 21:04:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to / / / / \ \/ \/ `/ / '/ // ./_,// //_\ version 1.6.2 /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_111) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. 16/09/20 21:05:10 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/09/20 21:05:10 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 16/09/20 21:05:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/09/20 21:05:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException SQL context available as sqlContext.

To get verbose output use –verbose options as below

 $ SPARK_HOME/bin/spark-shell --verbose

In some cases if you get an error like this:

16/09/20 21:03:41 WARN Utils: Service 'sparkDriver' could not bind on port 7815. Attempting port 7816.
16/09/20 21:03:41 ERROR SparkContext: Error initializing SparkContext.

It means the spark is trying bind at some un-configurable hostname

Note: In EC2 instance if you set [ hostname ec2-x-x-x-x..compute-1.amazonaws.com] you may see this problem.

$ export SPARK_LOCAL_IP="localhost"
$ SPARK_HOME/bin/spark-shell

Once spark shell is up you can create H2O context as below (Sparkling Water 1.6.7):

scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala> val hc = H2OContext.getOrCreate(sc)

Try to get H2O context as below and you will see H2O information as below:

scala> hc res0: org.apache.spark.h2o.H2OContext =
Sparkling Water Context: H2O name: sparkling-water-ec2-user_1664047936 cluster size: 1 * list of used nodes: (executorId, host, port) ------------------------ (driver,localhost,54321) ------------------------

Open H2O Flow in browser: http://127.0.0.1:54321 (CMD + click in Mac OSX)