Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.

Lets first import all the classes we need for this project:

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

Next we need to start H2O cluster so we can start using H2O APIs:

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Now we need to ingest the data which we can use to perform modeling:

// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

// Understanding our input data
prostateData.names
prostateData.numCols
prostateData.numRows
prostateData.keys
prostateData.key

Now we will import some H2O specific classes we need to perform our actions:

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

Lets setup GBM Parameters which will shape our GBM modeling process:

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE

In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:

val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()

Lets get to know our GBM Model and we will see that the type of this model is “regression”:

gbmRegModel

Lets perform prediction using GBM Regression Model:

val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:

prostateData.names()
//
// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException: 
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE

We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:

import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli

Now lets build our GBM  model now:

val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()

Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :

gbmBinModel

Now lets perform the prediction using our GBM Binomial Classification Model as below:

val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats all, enjoy!!

 

 

 

Advertisement

Gains and Lift tables for a classification model in H2O

The accuracy of the classification model for a random sample is evaluated according to the results when the model is and is not used. The Gains/Lift Table page uses predicted data to evaluate model performance. More details on gains and lift table in H2O can be found here.

If you are looking for how to get these model metrics from H2O model in python you can look at here.

When looking at the gains and lift table for a classification model in H2O Flow you will see the results as below:

Screen Shot 2017-04-11 at 1.50.39 PM

The gains and lift table shows results into 16 buckets and the buckets are fixed for the following values.

[0.01, 0.02, 0.03, 0.04, 0.05, 0.10, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

 

Here are some of the python code snippets you can use to try it out:

mymodel = h2o.get_model("gbm_pojo_test")
mymodel.gains_lift()
mymodel.gains_lift()['group']

Thats all, enjoy!!