Building GBM model in R and exporting POJO and MOJO model

Get the dataset:

Training:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_train.csv.gz

Test:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_test.csv.gz

Here is the script to build GBM grid model and export MOJO model:

library(h2o)
h2o.init()

# Importing Dataset
trainfile <- file.path("/Users/avkashchauhan/learn/adult_2013_train.csv.gz")
adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
testfile <- file.path("/Users/avkashchauhan/learn/adult_2013_test.csv.gz")
adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")

# Display Dataset
adult_2013_train
adult_2013_test

# Feature Engineering
actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
 adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
 adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
}
predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

# Building GBM Model:
log_wagp_gbm_grid <- h2o.gbm(x = predset,
 y = "LOG_WAGP",
 training_frame = adult_2013_train,
 model_id = "GBMModel",
 distribution = "gaussian",
 max_depth = 5,
 ntrees = 110,
 validation_frame = adult_2013_test)

log_wagp_gbm_grid

# Prediction 
h2o.predict(log_wagp_gbm_grid, adult_2013_test)

# Download POJO Model:
h2o.download_pojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

# Download MOJO model:
h2o.download_mojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

You will see GBM_model.java (as POJO Model) and GBM_model.zip (MOJO model) at the location where you will save these models.

Thats it, enjoy!

 

Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM:

https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/

To add cross validation you can do the following:

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._nfolds = 3  ###### Here is cross-validation ###
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
}

To look cross-validated model try this:

scala> glmModel._output._cross_validation_models
res12: Array[water.Key[_ <: water.Keyed[_ <: AnyRef]]] = 
    Array(glmModel.hex_cv_1, glmModel.hex_cv_2, glmModel.hex_cv_3)

Now to get each model do the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]

And you will see the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]
m1: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_1
 frame id: glmModel.hex_cv_1_train
 MSE: 0.14714406
 RMSE: 0.38359362
 AUC: 0.7167627
 logloss: 0.4703465
 mean_per_class_error: 0.31526923
 default threshold: 0.27434438467025757
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10704 1651 0.1336 1,651 / 12,355
 1 1768 1790 0.4969 1,768 / 3,558
Totals 12472 3441 0.2149 3,419 / 15,913
Gains/Lift Table (Avg response rate: 22.36 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005467 0....
scala> val m2 = DKV.getGet("glmModel.hex_cv_2").asInstanceOf[GLMModel]
m2: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_2
 frame id: glmModel.hex_cv_2_train
 MSE: 0.14598908
 RMSE: 0.38208517
 AUC: 0.7231473
 logloss: 0.46717605
 mean_per_class_error: 0.31456697
 default threshold: 0.29637953639030457
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 11038 1395 0.1122 1,395 / 12,433
 1 1847 1726 0.5169 1,847 / 3,573
Totals 12885 3121 0.2025 3,242 / 16,006
Gains/Lift Table (Avg response rate: 22.32 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005873 0...
scala> val m3 = DKV.getGet("glmModel.hex_cv_3").asInstanceOf[GLMModel]
m3: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_3
 frame id: glmModel.hex_cv_3_train
 MSE: 0.14626761
 RMSE: 0.38244948
 AUC: 0.7239823
 logloss: 0.46873763
 mean_per_class_error: 0.31437498
 default threshold: 0.28522220253944397
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10982 1490 0.1195 1,490 / 12,472
 1 1838 1771 0.5093 1,838 / 3,609
Totals 12820 3261 0.2070 3,328 / 16,081
Gains/Lift Table (Avg response rate: 22.44 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01001182 0...
scala>

Thats it, enjoy!!

 

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

Using H2O models into Java for scoring or prediction

This sample generate a GBM model from R H2O library and then consume the model into Java for prediction.

Here is R Script to generate sample model using H2O

setwd("/tmp/resources/")
library(h2o)
h2o.init()
df = iris
h2o_df = as.h2o(df)
y = "Species"
x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
model = h2o.gbm(y = y, x = x, training_frame = h2o_df)
model
h2o.download_mojo(model, get_genmodel_jar = TRUE)

Here is the Java code to use Model for prediction:

import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;

public class main {
    static void printIt(String message, MultinomialModelPrediction p) {
        System.out.println("");
        System.out.println(message);
        for (int i = 0; i < p.classProbabilities.length; i++) {
            if (i > 0) {
                System.out.print(",");
            }
            System.out.print(p.classProbabilities[i]);
        }
        System.out.println("");
    }
    public static void main(String[] args) throws Exception {
        EasyPredictModelWrapper model_orig = new EasyPredictModelWrapper(MojoModel.load("unzipped_orig"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All 1s, orig", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All NAs, orig", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");

            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("Sepal width NA, orig", p);
        }
        // -------------------
        EasyPredictModelWrapper model_modified = new EasyPredictModelWrapper(MojoModel.load("unzipped_modified"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All 1s (with sepwid instead of Sepal.Width), modified", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All NAs, modified", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("Sepal width NA (with Sepal.Width instead of sepwid), modified", p);
        }
    }
}

After the MOJO is downloaded you can see the model.ini as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

If you decided to modify model.ini by renaming column (i.e.sepal.width to sepwid) you can do as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
SepWid
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

Now we can run the Java commands to test the code as below:

$ java -cp .:h2o-genmodel.jar main

All 1s, orig
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, orig
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA, orig
0.7704658301004306,0.19829292017147707,0.03124124972809238

All 1s (with sepwid instead of Sepal.Width), modified
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, modified
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA (with Sepal.Width instead of sepwid), modified
0.7704658301004306,0.19829292017147707,0.03124124972809238
 Thats it, enjoy!!

Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Preduction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

Cross-validation example with time-series data in R and H2O

What is Cross-validation: In k-fold crossvalidation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. learn more at wiki..

When you have time-series data splitting data randomly from random rows does not work because the time part of your data will be mangled so doing cross-validation with time series dataset is done differently.

The following R code script show how it is split first and the passed as validation frame into different algorithms in H2O.

library(h2o)

h2o.init(strict_version_check = FALSE)

# show general information on the airquality dataset

colnames(airquality)

dim(airquality)

print(paste(‘number of months:’,length(unique(airquality$Month)), sep=“”))

# add a year column, so you can create a month, day, year date stamp

airquality$Year <- rep(2017,nrow(airquality))

airquality$Date <- as.Date(with(airquality, paste(Year, Month, Day,sep=“-“)), “%Y-%m-%d”)

# sort the dataset

airquality <- airquality[order(as.Date(airquality$Date, format=“%m/%d/%Y”)),]

# convert the dataset to unix time before converting to an H2OFrame

airquality$Date <- as.numeric(as.POSIXct(airquality$Date, origin=“1970-01-01”, tz = “GMT”))

# convert to an h2o dataframe

air_h2o <- as.h2o(airquality)

# specify the features and the target column

target <- ‘Ozone’

features <- c(“Solar.R”, “Wind”, “Temp”,  “Month”, “Day”, “Date”)

# split dataset in ~half which if you round up is 77 rows (train on the first half of the dataset)

train_1 <- air_h2o[1:ceiling(dim(air_h2o)[1]/2),]

# calculate 14 days in unix time: one day is 86400 seconds in unix time (aka posix time, epoch time)

# use this variable to iterate forward 12 days

add_14_days <- 86400*14

# initialize a counter for the while loop so you can keep track of which fold corresponds to which rmse

counter <- 0

# iterate over the process of testing on the next two weeks

# combine the train_1 and test_1 datasets after each loop

while (dim(train_1)[1] < dim(air_h2o)[1]){

    # get new dataset two weeks out

    # take the last date in Date column and add 14 days to i

    new_end_date <- train_1[nrow(train_1),]$Date + add_14_days

    last_current_date <- train_1[nrow(train_1),]$Date

    

    # slice with a boolean mask

    mask <- air_h2o[,“Date”] > last_current_date

    temp_df <- air_h2o[mask,]

    mask_2 <- air_h2o[,“Date”] < new_end_date

    

    # multiply the mask dataframes to get the intersection

    final_mask <- mask*mask_2

    test_1 <- air_h2o[final_mask,]

    

    # build a basic gbm using the default parameters

    gbm_model <- h2o.gbm(x = features, y = target, training_frame = train_1, validation_frame = test_1, seed = 1234)

    

    # print the number of rows used for the test_1 dataset

    print(paste(‘number of rows used in test set: ‘, dim(test_1)[1], sep=” “))

    print(paste(‘number of rows used in train set: ‘, dim(train_1)[1], sep=” “))

    # print the validation metrics

    rmse_valid <- h2o.rmse(gbm_model, valid=T)

    print(paste(‘your new rmse value on the validation set is: ‘, rmse_valid,‘ for fold #: ‘, counter, sep=“”))

    

    # create new training frame

    train_1 <- h2o.rbind(train_1,test_1)

    print(paste(‘shape of new training dataset: ‘,dim(train_1)[1],sep=” “))

    counter <<- counter + 1

}

Thats all, enjoy!!