Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM:

https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/

To add cross validation you can do the following:

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._nfolds = 3  ###### Here is cross-validation ###
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
}

To look cross-validated model try this:

scala> glmModel._output._cross_validation_models
res12: Array[water.Key[_ <: water.Keyed[_ <: AnyRef]]] = 
    Array(glmModel.hex_cv_1, glmModel.hex_cv_2, glmModel.hex_cv_3)

Now to get each model do the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]

And you will see the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]
m1: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_1
 frame id: glmModel.hex_cv_1_train
 MSE: 0.14714406
 RMSE: 0.38359362
 AUC: 0.7167627
 logloss: 0.4703465
 mean_per_class_error: 0.31526923
 default threshold: 0.27434438467025757
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10704 1651 0.1336 1,651 / 12,355
 1 1768 1790 0.4969 1,768 / 3,558
Totals 12472 3441 0.2149 3,419 / 15,913
Gains/Lift Table (Avg response rate: 22.36 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005467 0....
scala> val m2 = DKV.getGet("glmModel.hex_cv_2").asInstanceOf[GLMModel]
m2: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_2
 frame id: glmModel.hex_cv_2_train
 MSE: 0.14598908
 RMSE: 0.38208517
 AUC: 0.7231473
 logloss: 0.46717605
 mean_per_class_error: 0.31456697
 default threshold: 0.29637953639030457
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 11038 1395 0.1122 1,395 / 12,433
 1 1847 1726 0.5169 1,847 / 3,573
Totals 12885 3121 0.2025 3,242 / 16,006
Gains/Lift Table (Avg response rate: 22.32 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005873 0...
scala> val m3 = DKV.getGet("glmModel.hex_cv_3").asInstanceOf[GLMModel]
m3: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_3
 frame id: glmModel.hex_cv_3_train
 MSE: 0.14626761
 RMSE: 0.38244948
 AUC: 0.7239823
 logloss: 0.46873763
 mean_per_class_error: 0.31437498
 default threshold: 0.28522220253944397
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10982 1490 0.1195 1,490 / 12,472
 1 1838 1771 0.5093 1,838 / 3,609
Totals 12820 3261 0.2070 3,328 / 16,081
Gains/Lift Table (Avg response rate: 22.44 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01001182 0...
scala>

Thats it, enjoy!!

 

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Preduction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

Binomial classification example in Scala and GLM with H2O

Here is a sample for binomial classification problem using H2O GLM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Prediction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

 

Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!

Using one Hot Encoding and Stratified Sampling in Sparkling Water & H2O

Here is an example Scala code which shows how to use new Sparkling Water package to setup OneHot encoding and Stratified Sampling:

import org.apache.spark.h2o._
import water.Key
import water.etl.prims.advmath.AdvMath.StratifiedSplit
import water.etl.prims.mungers.Mungers.OneHotEncoder
import water.etl.prims.operators.Operators.Eq
import water.fvec.Frame
import water.fvec.H2OFrame
import java.net.URI

// Use the following line if you decided to setup external SW Kluster
// val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
//val hc = H2OContext.getOrCreate(sc, conf)

// OR use the following for basic configuration
val hc = H2OContext.getOrCreate(sc)
val fr = new H2OFrame(new java.io.File("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/airlines/AirlinesTest.csv.zip"))
val frOH = OneHotEncoder(fr, "Origin")
fr.add(frOH)   // Combine the pivoted result to the original frame
val trainTestCol = StratifiedSplit(fr,"IsDepDelayed",0.2,123);
val idx = Eq(trainTestCol,"train")
val train = fr.deepSlice(idx,null)   

// get subset of the Frame according to True/False of boolean 1 column Frame "idx"
val idx2 = Eq(trainTestCol,"test")
val test = fr.deepSlice(idx2,null)
println(train.toString(0L,10))
Thats it, enjoy!!

Creating, Adding and managing H2O frame in Scala

Creating a new H2O Frame:

To create a new frame in H2O you will call as below:

val df = new Frame()

Adding a frame to another H2O Frame:

To add an H2O from to another H2O frame you will do the following:
frame1.add(frame2)
When h2oDataFrame.add() method is called, it mutates the calling frame. It doesn’t create a new Frame and the Frame keeps the same Key.  Its the same object in memory.
What happens is that frame1 now depends on frame2.  Frame “frame1” has the new columns but they are actually the data from “frame2”. Looking into this operation, it looks like data has been duplicated because there are 2 keys in the DKV, but actually there has been no memory copy at all.  If you delete “frame2” you will run into an error , because the Frame “frame1” now depends on “frame2”.
In general managing memory in H2O DKV, there is no automated way of deleting old Frames during your program execution, you just need to manually call Frame.delete() on the Frames you no longer need.

Difference of using val vs var in Scala with new frame:

While looking from Scala point of view val dataframeNew = new Frame() doesn’t stop you from changing the dataframeNew frame with dataframeNew.add,  this does however stop you from reassigning dataframeNew to a different instance of a Frame.
Note: If you had var dataframeNew = new Frame(), then this df can be set to a completely different Frame. The reason for this difference is mainly because how Scala treats the val vs var in variable assignment.
Thats it, enjoy