In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.
Lets first import all the classes we need for this project:
import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File
import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._
// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._
Next we need to start H2O cluster so we can start using H2O APIs:
// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._
Now we need to ingest the data which we can use to perform modeling:
// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))
// Understanding our input data
prostateData.names
prostateData.numCols
prostateData.numRows
prostateData.keys
prostateData.key
Now we will import some H2O specific classes we need to perform our actions:
import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters
Lets setup GBM Parameters which will shape our GBM modeling process:
val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE
In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:
val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()
Lets get to know our GBM Model and we will see that the type of this model is “regression”:
gbmRegModel
Lets perform prediction using GBM Regression Model:
val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))
Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:
prostateData.names()
//
// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
// - water.exceptions.H2OModelBuilderIllegalArgumentException:
// - Illegal argument(s) for GBM model: gbmModel.hex. Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical
withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}
gbmParams._response_column = 'CAPSULE
We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:
import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli
Now lets build our GBM model now:
val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()
Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :
gbmBinModel
Now lets perform the prediction using our GBM Binomial Classification Model as below:
val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))
Thats all, enjoy!!