Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.

Lets first import all the classes we need for this project:

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key


// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

Next we need to start H2O cluster so we can start using H2O APIs:

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Now we need to ingest the data which we can use to perform modeling:

// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/"))

// Understanding our input data

Now we will import some H2O specific classes we need to perform our actions:

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

Lets setup GBM Parameters which will shape our GBM modeling process:

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE

In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:

val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()

Lets get to know our GBM Model and we will see that the type of this model is “regression”:


Lets perform prediction using GBM Regression Model:

val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame)

Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:

// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException: 
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE

We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:

import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli

Now lets build our GBM  model now:

val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()

Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :


Now lets perform the prediction using our GBM Binomial Classification Model as below:

val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame)

Thats all, enjoy!!





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s