H2O AutoML examples in python and Scala

AutoML is included into H2O version 3.14.0.1 and above. You can learn more about AutoML in the H2O blog here.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

Here is the full working python code taken from here:

import h2o
from h2o.automl import H2OAutoML

h2o.init()
df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
train, test = df.split_frame(ratios=[.9])
# Identify predictors and response
x = train.columns
y = "CAPSULE"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 60 seconds
aml = H2OAutoML(max_runtime_secs = 60)
aml.train(x = x, y = y, training_frame = train, leaderboard_frame = test)

# View the AutoML Leaderboard
aml.leaderboard
aml.leader

# To generate predictions on a test set, use `"H2OAutoML"` object, or on the leader model object directly as below:
preds = aml.predict(test)
# or
preds = aml.leader.predict(test)

Here is the full working Scala code:

import ai.h2o.automl.AutoML;
import ai.h2o.automl.AutoMLBuildSpec
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import java.io.File
import h2oContext.implicits._
import water.Key
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))
val autoMLBuildSpec = new AutoMLBuildSpec()
autoMLBuildSpec.input_spec.training_frame = prostateData
autoMLBuildSpec.input_spec.response_column = "CAPSULE";
autoMLBuildSpec.build_control.loss = "AUTO"
autoMLBuildSpec.build_control.stopping_criteria.set_max_runtime_secs(5)
import java.util.Date;
val aml = AutoML.makeAutoML(Key.make(), new Date(), autoMLBuildSpec)
AutoML.startAutoML(aml)
// Note: In some cases the above call is non-blocking
// So using the following alternative function will block the next commmand, untill the exection of action command
AutoML.startAutoML(autoMLBuildSpec).get()  ## This is forced blocking call
aml.leader
aml.leaderboard

IF you want to see the full code execution visit here.

Thats it, enjoy!!

Advertisements

Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.

Lets first import all the classes we need for this project:

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

Next we need to start H2O cluster so we can start using H2O APIs:

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Now we need to ingest the data which we can use to perform modeling:

// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

// Understanding our input data
prostateData.names
prostateData.numCols
prostateData.numRows
prostateData.keys
prostateData.key

Now we will import some H2O specific classes we need to perform our actions:

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

Lets setup GBM Parameters which will shape our GBM modeling process:

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE

In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:

val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()

Lets get to know our GBM Model and we will see that the type of this model is “regression”:

gbmRegModel

Lets perform prediction using GBM Regression Model:

val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:

prostateData.names()
//
// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException: 
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE

We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:

import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli

Now lets build our GBM  model now:

val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()

Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :

gbmBinModel

Now lets perform the prediction using our GBM Binomial Classification Model as below:

val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats all, enjoy!!

 

 

 

Scala Example with Grid Search and Hyperparameters for GBM in H2O

Here is the full source code for GBM Scala code to perform Grid Search and Hyper parameters optimization using H2O (here is the github code as well):

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

// Register files to SparkContext
addFiles(sc,
 "/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/year2005.csv.gz",
 "/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/Chicago_Ohare_International_Airport.csv")

// Import all year airlines data into H2O
val airlinesData = new H2OFrame(new File(SparkFiles.get("year2005.csv.gz")))

// Import weather data into Spark
val wrawdata = sc.textFile(SparkFiles.get("Chicago_Ohare_International_Airport.csv"),8).cache()
val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

// Transfer data from H2O to Spark DataFrame
val airlinesTable = h2oContext.asDataFrame(airlinesData).map(row => AirlinesParse(row))
val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))

// Use Spark SQL to join flight and weather data in spark
flightsToORD.toDF.createOrReplaceTempView("FlightsToORD")
weatherTable.toDF.createOrReplaceTempView("WeatherORD")

// Perform SQL Join on both tables
val bigTable = sqlContext.sql(
 """SELECT
 |f.Year,f.Month,f.DayofMonth,
 |f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
 |f.UniqueCarrier,f.FlightNum,f.TailNum,
 |f.Origin,f.Distance,
 |w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
 |f.IsDepDelayed
 |FROM FlightsToORD f
 |JOIN WeatherORD w
 |ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)




val trainFrame:H2OFrame = bigTable
withLockAndUpdate(trainFrame){ fr => fr.replace(19, fr.vec("IsDepDelayed").toCategoricalVec)}

bigTable.numCols
bigTable.numRows

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

val gbmParams = new GBMParameters()

gbmParams._train = trainFrame
gbmParams._response_column = 'IsDepDelayed

import _root_.hex.genmodel.utils.DistributionFamily

gbmParams._distribution = DistributionFamily.bernoulli

val gbm = new GBM(gbmParams,Key.make("gbmModel.hex"))
val gbmModel = gbm.trainModel.get
// Same as above
val gbmModel = gbm.trainModel().get()

// Use model to estimate delay on training data
val predGBMH2OFrame = gbmModel.score(trainFrame)('predict)
val predGBMFromModel = asRDD[DoubleHolder](predGBMH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

def let[A](in: A)(body: A => Unit) = {
 body(in)
 in
}




import _root_.hex.grid.{GridSearch}
import _root_.hex.grid.GridSearch
import _root_.hex.ScoreKeeper

import water.Key
import scala.collection.JavaConversions._

val gbmHyperSpace: java.util.Map[String, Array[Object]] = Map[String, Array[AnyRef]](
 "_ntrees" -> (1 to 10).map(v => Int.box(100*v)).toArray,
 "_max_depth" -> (2 to 7).map(Int.box).toArray,
 "_learn_rate" -> Array(0.1, 0.01).map(Double.box),
 "_col_sample_rate" -> Array(0.3, 0.7, 1.0).map(Double.box),
 "_learn_rate_annealing" -> Array(0.8, 0.9, 0.95, 1.0).map(Double.box)
)

// @Snippet
import _root_.hex.grid.HyperSpaceSearchCriteria.RandomDiscreteValueSearchCriteria




val gbmHyperSpaceCriteria = let(new RandomDiscreteValueSearchCriteria) { c =>
 c.set_stopping_metric(ScoreKeeper.StoppingMetric.RMSE)
 c.set_stopping_tolerance(0.1)
 c.set_stopping_rounds(1)
 c.set_max_runtime_secs(4 * 60 /* seconds */)
}

//
// This step will create 
// If you will pass the code below it will run through also for long time
// val gs = GridSearch.startGridSearch(null, gbmParams, gbmHyperSpace);
// 
val gbmGrid = GridSearch.startGridSearch(Key.make("gbmGridModel"),
 gbmParams,
 gbmHyperSpace,
 new GridSearch.SimpleParametersBuilderFactory[GBMParameters],
 gbmHyperSpaceCriteria).get()




// Training Frame Info
gbmGrid.getTrainingFrame

//
// Looking at gird models by Keys
//
val mKeys = gbmGrid.getModelKeys()
gbmGrid.createSummaryTable(mKeys, "mse", true);
gbmGrid.createSummaryTable(mKeys, "rmse", true);

// Model Count
gbmGrid.getModelCount

// All Models
gbmGrid.getModels
val ms = gbmGrid.getModels()
val gbm =ms(0)
val gbm =ms(1)
val gbm =ms(2)

// All hyper parameters
gbmGrid.getHyperNames

Thats it, Enjoy!!

 

H2O backend and API processing through Rapids

H2O cluster support various frontend i.e. python, R, FLOW etc and all the functions at these various front ends are handled through H2O cluster backend through API. Frontend actions are translated into API and H2O backend handles these API through Rapid expressions. We will understand how these APIs are handled from backend.

Lets Start H2O from command line directly from h2o.jar

$ java -jar h2o.jar

Now use python to connect with H2O

> import h2o

> h2o.init()

> h2o.ls()

Note: You will see there are no keys as the result of h2o.ls()

> df = h2o.create_frame(cols=2, rows=5,integer_range=1,time_fraction=1)

> h2o.ls()

Note: Now you will see a new key shown as below:

key

0     py_32_sid_9613

Note: Above py_32_sid_9613 is the frame ID in H2O memory for the frame we just created using create_frame API.

> df

2013-09-26 19:47:37   1995-01-01 16:14:34

1983-12-04 04:05:07    1974-09-08 23:06:41

2015-03-03 01:56:36    1982-11-03 19:21:53

1979-10-20 08:35:22     1987-10-09 14:24:59

1990-09-26 11:56:17     1981-08-16 04:23:02

> df.sort([‘C1′,’C2’])

C1                                    C2

1979-10-20 08:35:22     1987-10-09 14:24:59

1983-12-04 04:05:07     1974-09-08 23:06:41

1990-09-26 11:56:17     1981-08-16 04:23:02

2013-09-26 19:47:37      1995-01-01 16:14:34

2015-03-03 01:56:36      1982-11-03 19:21:53

> h2o.ls()

key

0     py_32_sid_9613

1     py_34_sid_9613

Note: As we ran the sort operation on the given frame df, another temporary frame py_34_sid_9613 was created. If you have created a new data frame to store sorted records as below a new frame would have been created as well to store the results of frame ndf as below:

> ndf = df.sort([‘C1′,’C2’])

Now if you look at the H2O logs you will see how the Rapids are

09-08 11:10:33.204 10.0.0.46:54321 20753 #02927-14 INFO: 
    POST /99/Rapids, parms: {ast=(tmp= py_34_sid_9613 
        (sort py_32_sid_9613 ['C1' 'C2'])), session_id=_sid_9613}

Looking into the above logs we can understand the following:

Function sort was applied on frame  py_32_sid_9613 with parameters as columns [‘C1′,’C2’] and the result of this operation is frame  py_34_sid_9613.

This is how you can decipher H2O Rapids for any H2O API you tried.

That’s all, enjoy!!

Getting all categorical for predictors in H2O POJO and MOJO models

Here is the Java/Scala code snippet which shows how you can get the categorical values for each enum/factor predictor from H2O POJO and MOJO Models:

to get the list of all column names in your POJO/MOJO model, you can try the following:

Imports:

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

POJO:

## First use the POJO model class as below:
private static String modelClassName = "gbm_prostate_binomial";

##Then you can GenModel class to get info you are looking for as below:
hex.genmodel.GenModel rawModel;
rawModel = (hex.genmodel.GenModel) Class.forName(modelClassName).newInstance();

## Now you can get the results as below:
System.out.println("isSupervised : " + rawModel.isSupervised());
System.out.println("Columns Names :  " + Arrays.toString(rawModel.getNames()));
System.out.println("Response ID : " + rawModel.getResponseIdx());
System.out.println("Number of columns : " + rawModel.getNumCols());
System.out.println("Response Name : " + rawModel.getResponseName());

## Printing all categorical values for each predictors
for (int i = 0; i < rawModel.getNumCols(); i++) 
{
 String[] domainValues = rawModel.getDomainValues(i);
 System.out.println(Arrays.toString(domainValues));
}
Output Results:
isSupervised : true
Column Names : [ID, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON]
Response ID : 8
Number of columns : 8
null
null
[0, 1, 2]
null
null
null
null
null
Note: For all null values means the predictor was numeric values and all the categorical values are listed for the each enum/factor predictor.

MOJO:

## Lets assume you have MOJO model as gbm_prostate_binomial.zip
## You would need to load your model as below:
hex.genmodel.GenModel mojo = MojoModel.load("gbm_prostate_binomial.zip");

## Now you can get list of predictors as below:
System.out.println("isSupervised : " + mojo.isSupervised());
System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));
System.out.println("Number of columns : " + mojo.getNumCols());
System.out.println("Response ID : " + mojo.getResponseIdx());
System.out.println("Response Name : " + mojo.getResponseName());

## Printing all categorical values for each predictors
for (int i = 0; i < mojo.getNumCols(); i++) {
 String[] domainValues = mojo.getDomainValues(i);
 System.out.println(Arrays.toString(domainValues));
 }
Output Results:
isSupervised : true
Column Names : [ID, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON]
Response ID : 8
Number of columns : 8
null
null
[0, 1, 2]
null
null
null
null
null
Note: For all null values means the predictor was numeric values and all the categorical values are listed for the each enum/factor predictor.

To can get help on using MOJO and POJO models visit the following:

That’s it, enjoy!!

Scoring with H2O MOJO model at command line with Java

If you are using H2O MOJO model you can use that to scoring in python or any other language just by using java runtime. This is a quick hack way to do the scoring on command line or from Python. Here are few example:

What you will have:

  • H2O MOJO model (it will be i.e. gbm_prostate_new.zip)
  • H2O supported class file for scoring i.e.  h2o-genmodel.jar
  • Your data set in JSON format to score i.e. ‘{“AGE”:”68″,”RACE”:”2″, “DCAPS”:”2″, “VOL”:”0″,”GLEASON”:”6″ }’

 

Here is command line way to perform scoring:

$ java -Xmx4g -cp .:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/h2o-genmodel.jar:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring:genmodel.jar:/ water.util.H2OPredictor /Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/gbm_prostate_new.zip ‘{\”AGE\”:\”68\”, \”RACE\”:\”2\”, \”DCAPS\”:\”2\”, \”VOL\”:\”0\”,\”GLEASON\”:\”6\” }’

Here is the results of above command:

{“labelIndex”:1,”label”:”1″,”classProbabilities”:[0.44056667027822005,0.55943332972178]}

Here is python code to score by launch external Java process:

> import subprocess

> gen_model_arg = ‘.:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/h2o-genmodel.jar:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring:genmodel.jar:/’> h2o_predictor_class = ‘water.util.H2OPredictor’
> mojo_model_args = ‘/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/gbm_prostate_new.zip’
> json_data = {“AGE”:”68″,”RACE”:”2″, “DCAPS”:”2″, “VOL”:”0″,”GLEASON”:”6″}

Calling subprocess module:

> output = subprocess.check_output([“java” , “-Xmx4g”, “-cp”, gen_model_arg, h2o_predictor_class,
mojo_model_args, json_data], shell=False).decode()

## Generating output

> output

u'[ {"labelIndex":0,"label":"0",
    "classProbabilities":[0.8378244965684887,0.1621755034315113]} 
  ]\n'

Thats it, enjoy!!

 

Getting predictors from H2O POJO and MOJO models in Java and Scala

Here is the Java/Scala code snippet which shows how you can get the predictors and response details from H2O POJO and MOJO Models:

to get the list of all column names in your POJO/MOJO model, you can try the following:

Imports:

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

POJO:

## First use the POJO model class as below:
private static String modelClassName = "gbm_prostate_binomial";

##Then you can GenModel class to get info you are looking for as below:
hex.genmodel.GenModel rawModel;
rawModel = (hex.genmodel.GenModel) Class.forName(modelClassName).newInstance();

## Now you can get the results as below:
System.out.println("isSupervised : " + rawModel.isSupervised());
System.out.println("Columns Names :  " + Arrays.toString(rawModel.getNames()));

MOJO:

## Lets assume you have MOJO model as gbm_prostate_binomial.zip
## You would need to load your model as below:
hex.genmodel.GenModel mojo = MojoModel.load("gbm_prostate_binomial.zip");

## Now you can get list of predictors as below:
System.out.println("isSupervised : " + mojo.isSupervised());
System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));

To can get help on using MOJO and POJO models visit the following:

That’s it, enjoy!!