Using H2O models into Java for scoring or prediction

This sample generate a GBM model from R H2O library and then consume the model into Java for prediction.

Here is R Script to generate sample model using H2O

setwd("/tmp/resources/")
library(h2o)
h2o.init()
df = iris
h2o_df = as.h2o(df)
y = "Species"
x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
model = h2o.gbm(y = y, x = x, training_frame = h2o_df)
model
h2o.download_mojo(model, get_genmodel_jar = TRUE)

Here is the Java code to use Model for prediction:

import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;

public class main {
    static void printIt(String message, MultinomialModelPrediction p) {
        System.out.println("");
        System.out.println(message);
        for (int i = 0; i < p.classProbabilities.length; i++) {
            if (i > 0) {
                System.out.print(",");
            }
            System.out.print(p.classProbabilities[i]);
        }
        System.out.println("");
    }
    public static void main(String[] args) throws Exception {
        EasyPredictModelWrapper model_orig = new EasyPredictModelWrapper(MojoModel.load("unzipped_orig"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All 1s, orig", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All NAs, orig", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");

            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("Sepal width NA, orig", p);
        }
        // -------------------
        EasyPredictModelWrapper model_modified = new EasyPredictModelWrapper(MojoModel.load("unzipped_modified"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All 1s (with sepwid instead of Sepal.Width), modified", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All NAs, modified", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("Sepal width NA (with Sepal.Width instead of sepwid), modified", p);
        }
    }
}

After the MOJO is downloaded you can see the model.ini as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

If you decided to modify model.ini by renaming column (i.e.sepal.width to sepwid) you can do as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
SepWid
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

Now we can run the Java commands to test the code as below:

$ java -cp .:h2o-genmodel.jar main

All 1s, orig
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, orig
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA, orig
0.7704658301004306,0.19829292017147707,0.03124124972809238

All 1s (with sepwid instead of Sepal.Width), modified
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, modified
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA (with Sepal.Width instead of sepwid), modified
0.7704658301004306,0.19829292017147707,0.03124124972809238
 Thats it, enjoy!!

How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
                                    beta_constraints=bc,
                                    standardize=False)
glm.coef()
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!

Tips building H2O and Deep Water source code

Get source code:

Building Source code without test:

Build the source code without tests (For both H2O-3 and DeepWater source)

$ ./gradlew build -x test

Build the Java developer version of source code  without tests (For both H2O-3 and DeepWater source)

$ ./gradlew build -x test

H2O tests uses various small and large file during the test. Which you can download separately depending on the size on your working machine. If you decide to download large data sets, it will take good amount of space from your disk.

To download all the large test data files:

$ ./gradlew syncBigdataLaptop

To download all the small test data files:

$ ./gradlew syncSmalldata

Using with intelliJ:

Pull the source code and then import as a project and use gradle as build system. Once project is loaded, if you want to just do the test run, select the following:

h2o-app > srcc > main > java > water > H2OApp

Screen Shot 2017-04-21 at 10.39.04 AM

Thats it, enjoy!!

Using one Hot Encoding and Stratified Sampling in Sparkling Water & H2O

Here is an example Scala code which shows how to use new Sparkling Water package to setup OneHot encoding and Stratified Sampling:

import org.apache.spark.h2o._
import water.Key
import water.etl.prims.advmath.AdvMath.StratifiedSplit
import water.etl.prims.mungers.Mungers.OneHotEncoder
import water.etl.prims.operators.Operators.Eq
import water.fvec.Frame
import water.fvec.H2OFrame
import java.net.URI

// Use the following line if you decided to setup external SW Kluster
// val conf = new H2OConf(sc).setExternalClusterMode().useManualClusterStart().setCloudName("test")
//val hc = H2OContext.getOrCreate(sc, conf)

// OR use the following for basic configuration
val hc = H2OContext.getOrCreate(sc)
val fr = new H2OFrame(new java.io.File("/Users/avkashchauhan/src/github.com/h2oai/h2o-3/smalldata/airlines/AirlinesTest.csv.zip"))
val frOH = OneHotEncoder(fr, "Origin")
fr.add(frOH)   // Combine the pivoted result to the original frame
val trainTestCol = StratifiedSplit(fr,"IsDepDelayed",0.2,123);
val idx = Eq(trainTestCol,"train")
val train = fr.deepSlice(idx,null)   

// get subset of the Frame according to True/False of boolean 1 column Frame "idx"
val idx2 = Eq(trainTestCol,"test")
val test = fr.deepSlice(idx2,null)
println(train.toString(0L,10))
Thats it, enjoy!!

Programmatically deleting data frame from H2O memory

Sometime when using H2o programmatically (mostly from Java, Scala) you may see that several intermediate data frames are persisted in the memory and you may need to free memory for better results. To delete frames you just need to list them out and then for the each frame you would want to delete, you just need to call delete() or remove() method. There is no DKV update needed for that.

frame.delete()
frame.remove()
That’s it, enjoy!

Managing timestamps values with n-fold cross validation

When using cross validation with n-folds user can choose a specific column as fold columns. More details on fold columns are described below:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_column.html

Using fold columns the various splits will be created into custom grouping based on numerical or categorical values into the fold column.

This is how fold column setting is used in H2O:

Screen Shot 2017-04-11 at 9.34.14 AM

 

Question: Is it wise to use time stamp values based column as fold column?

Answer: It is not advised to directly feed the timestamp values based column to the fold_column argument and it would be a wrong decision. The main reason is that there are too many unique values in this scenarios which will cause problems.

The best way to handle such scenario is to first form a new column that is based on year/month/day for the given timestamp values  and then feed this newly create column as fold column to your Cross Validation process. 

Thats it, enjoy!!