Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

Starter script for rsparkling (H2O on Spark with R)

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. Visit github project: https://github.com/h2oai/rsparkling

You must have the following package installed in your R environment:

You must have Sparkling Water latest package download and unzipped locally:

I am using the following package in my environment:

  • Spark 2.1
  • Sparkling Water 2.1.8
  • sparklyr 0.4.4
  • rsparkling 0.2.0

Now here is rspakrling script to create the cluster locally:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Now here is the rsparkling script to create Spark cluster with Yarn:

options(rsparkling.sparklingwater.location="/tmp/sparkling-water-assembly_2.11-2.1.8-all.jar")
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client/")
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
config$spark.executor.instances = 2
sc <- spark_connect(master = "yarn-client", config = config, version = '2.1.0')
print(sc)
h2o_context(sc, strict_version_check = FALSE)
h2o_flow(sc, strict_version_check = FALSE)
spark_disconnect(sc)

Thats it, Enjoy!!

Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Preduction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

Binomial classification example in Scala and GLM with H2O

Here is a sample for binomial classification problem using H2O GLM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Prediction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

 

Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!

Using one Hot Encoding and Stratified Sampling in Sparkling Water & H2O

Here is an example Scala code which shows how to use new Sparkling Water package to setup OneHot encoding and Stratified Sampling:

import org.apache.spark.h2o._
import water.Key
import water.etl.prims.advmath.AdvMath.StratifiedSplit
import water.etl.prims.mungers.Mungers.OneHotEncoder
import water.etl.prims.operators.Operators.Eq
import water.fvec.Frame
import water.fvec.H2OFrame
import java.net.URI

// Use the following line if you decided to setup external SW Kluster
// val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
//val hc = H2OContext.getOrCreate(sc, conf)

// OR use the following for basic configuration
val hc = H2OContext.getOrCreate(sc)
val fr = new H2OFrame(new java.io.File("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/airlines/AirlinesTest.csv.zip"))
val frOH = OneHotEncoder(fr, "Origin")
fr.add(frOH)   // Combine the pivoted result to the original frame
val trainTestCol = StratifiedSplit(fr,"IsDepDelayed",0.2,123);
val idx = Eq(trainTestCol,"train")
val train = fr.deepSlice(idx,null)   

// get subset of the Frame according to True/False of boolean 1 column Frame "idx"
val idx2 = Eq(trainTestCol,"test")
val test = fr.deepSlice(idx2,null)
println(train.toString(0L,10))
Thats it, enjoy!!

Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters.
With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs.
Here are the steps using Pyspakling with YARN on a hadoop cluster.
1. Get the latest build of sparkling water from here
2. Download and unzip the correct Sparkling Water version comparable with the Spark version into one of the edge node in your Hadoop cluster.
3. Set the following environment variables to the right path before running services:
export MASTER=”yarn-client” // To submit to the Yarn cluster
export SPARK_HOME=“path_to_the_directory_where_spark_unzipped”
export HADOOP_CONF_DIR=“path_to_the_hadoop_installation”export SPARK_SUBMIT_OPTIONS=”–packages ai.h2o:sparkling-water-examples_2.11:2.1.0”
export PYTHONPATH=“_path_to_where_python_installed”
export SPARKLING_EGG=$(ls -t /sparkling-water-2.1.0/py/build/dist/h2o_pysparkling*.egg | head -1)
//path to the Sparkling egg file needs to be updated above
Please make sure to check above version values to reflect the following:
  • 2.11-> refers to the scala version.
  • 2.1.0 —> refers to the spark version.
4. Set the “spark.executor.memory 4g” in Zeppelin either in the configuration file or in the Zeppelin UI if Error 143 is seen while starting the zeppelin server.
Note: To configure it in the Zeppelin UI, goto the dropdown next to the user at theTop right corner , select Interpreters and in the Spark section either edit or add the configuration.
5. Start the Zeppelin server using the command below. This would start Zeppelin in a Yarn container.
bin/zeppelin.sh -Pspark-2.1
6. In Zeppelin notebook, create a new note with the markdown as below and add the path to the egg file. This will add the dependency and the classes of pysparkling.
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
7. Now, one can start calling pysparkling API’s like below:
%pyspark
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o hc = H2OContext.getOrCreate(sc)
8. To use the scala Sparkling water, one does not need to add dependency explicitly in the note in Zeppelin. A sample script would look like

%spark

import org.apache.spark.h2o._
sc.version
val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
val h2oContext = H2OContext.getOrCreate(sc)
Thats all, enjoy!!