Building GBM model in R and exporting POJO and MOJO model

Get the dataset:

Training:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_train.csv.gz

Test:

http://h2o-training.s3.amazonaws.com/pums2013/adult_2013_test.csv.gz

Here is the script to build GBM grid model and export MOJO model:

library(h2o)
h2o.init()

# Importing Dataset
trainfile <- file.path("/Users/avkashchauhan/learn/adult_2013_train.csv.gz")
adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
testfile <- file.path("/Users/avkashchauhan/learn/adult_2013_test.csv.gz")
adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")

# Display Dataset
adult_2013_train
adult_2013_test

# Feature Engineering
actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
 adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
 adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
}
predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

# Building GBM Model:
log_wagp_gbm_grid <- h2o.gbm(x = predset,
 y = "LOG_WAGP",
 training_frame = adult_2013_train,
 model_id = "GBMModel",
 distribution = "gaussian",
 max_depth = 5,
 ntrees = 110,
 validation_frame = adult_2013_test)

log_wagp_gbm_grid

# Prediction 
h2o.predict(log_wagp_gbm_grid, adult_2013_test)

# Download POJO Model:
h2o.download_pojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

# Download MOJO model:
h2o.download_mojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

You will see GBM_model.java (as POJO Model) and GBM_model.zip (MOJO model) at the location where you will save these models.

Thats it, enjoy!

 

Advertisements

Saving H2O models from R/Python API in Hadoop Environment

When you are using H2O in clustered environment i.e. Hadoop the machine could be different where h2o.savemodel() is trying to write the model and thats why you see the error “No such file or directory”. If you just give the path i.e. /tmp and visit the machine ID where H2O connection is initiated from R, you will see the model stored there.
Here is a good example to understand it better:
Step [1] Starting Hadoop driver in EC2 environment as below:
[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o-3.10.4.8-hdp2.6/h2odriver.jar -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
....
....
....
Open H2O Flow in your web browser: http://10.0.65.248:54323  <=== H2O is started.
Note: Above you could see that hadoop command is ran on ip address 10.0.104.179 however the node where H2O server is shown as 10.0.65.248.
Step [2] Connect R client with H2O
> h2o.init(ip = "10.0.65.248", port = 54323, strict_version_check = FALSE)
Note: I have used the ip address as shown above to connect with existing H2O cluster. However the machine where I am running R client is different as its IP address is 34.208.200.16.
Step [3]: Saving H2O model:
h2o.saveModel(my.glm, path = "/tmp", force = TRUE)
So when I am saving the mode it is saved at 10.0.65.248 machine even when the R client was running at 34.208.200.16.
ec2-user@ip-10-0-65-248 ~]$ ll /tmp/GLM*
-rw-r--r-- 1 yarn hadoop 90391 Jun 2 20:02 /tmp/GLM_model_R_1496447892009_1
So you need to make sure you have access to a folder where H2O service is running or you can save model at HDFS something similar to as below:
h2o.saveModel(my.glm, path = "hdfs://ip-10-0-104-179.us-west-2.compute.internal/user/achauhan", force = TRUE)

Thats it, enjoy!!

Using RESTful API to get POJO and MOJO models in H2O

 

CURL API for Listing Models:

http://<hostname>:<port>/3/Models/

CURL API for Listing specific POJO Model:

http://<hostname>:<port>/3/Models/model_name

List Specific MOJO Model:

http://<hostname>:<port>/3/Models/glm_model/mojo

Here is an example:

curl -X GET "http://localhost:54323/3/Models"
curl -X GET "http://localhost:54323/3/Models/deeplearning_model" >> NAME_IT

curl -X GET "http://localhost:54323/3/Models/deeplearning_model" >> dl_model.java
curl -X GET "http://localhost:54323/3/Models/glm_model/mojo" > myglm_mojo.zip

Thats it, enjoy!!

Creating Partial Dependency Plot (PDP) in H2O

Starting from H2O 3.10.0.8 H2O added partial dependency plot which has the Java backend to do the mutli-scoring of the dataset with the model. This makes creating PDP much faster.

To get PDP in H2O you must need Model, and the original data set used to generate mode. Here are few ways to create PDP:

If you want to generate PDP on a single column:

response = h2o.predict(model, data.pdp[, column_name])
To generate PDP on the original data set:
response = h2o.predict(model, data.pdp)
If you want to build PDP directly from Model and dataset without using PDP API, you can the following code:
model = prostate.gbm
column_name = "AGE"
data.pdp = data.hex
bins = unique(h2o.quantile(data.hex[, column_name], probs = seq(0.05,1,0.05)) )
mean_responses = c()

for(bin in bins ){
  data.pdp[, column_name] = bin
  response = h2o.predict(model, data.pdp[, column_name])
  mean_response = mean(response[,ncol(response)])
  mean_responses = c(mean_responses, mean_response)
}

pdp_manual = data.frame(AGE = bins, mean_response = mean_responses)
plot(pdp_manual, type = "l")
Thats it, enjoy!!

Grid Search for Naive Bayes in R using H2O

Here is a R sample code to show how to perform grid search in Naive Bayes algorithm using H2O machine learning platform:

# H2O
library(h2o)
library(ggplot2)
library(data.table)
 
# initialize the cluster with all the threads available
h2o.init(nthreads = -1)
h2o.init()
h2o.init(max_mem_size = "2g")
 
# Variables Necesarias
train.h2o<-as.h2o(training)
test.h2o <-as.h2o(testing)
names(train.h2o)
str(train.h2o)
 
y <-4
x <-c(5:16)
 
# specify the list of paramters
hyper_params <- list(
 laplace = c(0,0.5,1,2,3)
)
 
threshold =c(0.001,0.00001,0.0000001)
 
# performs the grid search
grid_id <-"dl_grid"
model_bayes_grid <- h2o.grid(
 algorithm = "naivebayes", # name of the algorithm
 grid_id = grid_id,
 training_frame = train.h2o,
 validation_frame = test.h2o,
 x = x,
 y = y,
 hyper_params = hyper_params
)
 
# find the best model and evaluate its performance
stopping_metric <- 'accuracy'
sorted_models <- h2o.getGrid(
 grid_id = grid_id,
 sort_by = stopping_metric,
 decreasing = TRUE
)
 
best_model<-h2o.getModel(sorted_models@model_ids[[1]])
best_model
 
h2o.confusionMatrix(best_model, valid = TRUE, metrics = 'accuracy')
 

auc <- h2o.auc(best_model, valid = TRUE)
fpr <- h2o.fpr( h2o.performance(best_model, valid = TRUE) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(best_model, valid = TRUE) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
 geom_line() + theme_bw()+ggtitle( sprintf('AUC: %f', auc) )
 

# To obtain the regularization, laplace, do the following:
best_model@parameters
 best_model@parameters$laplace

Thats it, enjoy!!

Tips building H2O and Deep Water source code

Get source code:

Building Source code without test:

Build the source code without tests (For both H2O-3 and DeepWater source)

$ ./gradlew build -x test

Build the Java developer version of source code  without tests (For both H2O-3 and DeepWater source)

$ ./gradlew build -x test

H2O tests uses various small and large file during the test. Which you can download separately depending on the size on your working machine. If you decide to download large data sets, it will take good amount of space from your disk.

To download all the large test data files:

$ ./gradlew syncBigdataLaptop

To download all the small test data files:

$ ./gradlew syncSmalldata

Using with intelliJ:

Pull the source code and then import as a project and use gradle as build system. Once project is loaded, if you want to just do the test run, select the following:

h2o-app > srcc > main > java > water > H2OApp

Screen Shot 2017-04-21 at 10.39.04 AM

Thats it, enjoy!!

Using H2O with Microsoft R Open on Linux Machine

Installation:

Microsoft R Open Page: https://mran.microsoft.com/open/

Ubuntu Download link: https://mran.microsoft.com/install/mro/3.3.3/microsoft-r-open-3.3.3.tar.gz

$ wget https://mran.microsoft.com/install/mro/3.3.3/microsoft-r-open-3.3.3.tar.gz
$ tar -xvf microsoft-r-open-3.3.3.tar.gz
$ cd microsoft-r-open
$ sudo bash install.sh

Installation will be done into the following folder:

$ ll /usr/lib64/microsoft-r/3.3/lib64/R/bin/

drwxr-xr-x 11 root root 4096 Apr 20 15:28 ./
drwxr-xr-x 4 root root 4096 Apr 20 15:28 ../
drwxr-xr-x 3 root root 4096 Apr 20 15:28 backup/
drwxr-xr-x 3 root root 4096 Apr 20 15:28 bin/
-rw-r--r-- 1 root root 18011 Mar 28 13:35 COPYING
drwxr-xr-x 4 root root 4096 Apr 20 15:28 doc/
drwxr-xr-x 2 root root 4096 Apr 20 15:28 etc/
drwxr-xr-x 3 root root 4096 Apr 20 15:28 include/
drwxr-xr-x 2 root root 4096 Apr 20 15:28 lib/
drwxr-xr-x 47 root root 4096 Apr 20 15:28 library/
drwxr-xr-x 2 root root 4096 Apr 20 15:28 modules/
drwxr-xr-x 13 root root 4096 Apr 20 15:28 share/
-rw-r--r-- 1 root root 46 Mar 28 13:35 SVN-REVISION

Note If you already have R installed in the machine you may see Microsoft R link is not created and previous R is still available at /usr/bin/R. If that is the case you may need to create the symbolic link as below.

Creating symbolic link:

$ sudo ln -s /usr/lib64/microsoft-r/3.3/lib64/R/bin/R /usr/bin/MSR

Launching R:

You just need to do the following:

$ R

If you have created the symbolic link then use the following

$ MSR

Installing RCurl which is must to have for H2O:

> install.packages(“RCurl”)

Now installing H2O latest from the H2O Download link (https://www.h2o.ai/download/)

> install.packages(“h2o”, type = “source”, repos = (c(“http://h2o-release.s3.amazonaws.com/h2o/rel-ueno/5/R&#8221;))) :

Once H2O is installed you can use it. Here is the full execution log:

 

$ MSR

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Microsoft R Open 3.3.3
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2017 Microsoft Corporation
Using the Intel MKL for parallel mathematical computing(using 16 cores).
Default CRAN mirror snapshot taken on 2017-03-15.
See: https://mran.microsoft.com/.

> library(h2o)
----------------------------------------------------------------------
Your next step is to start H2O:
    > h2o.init()
For H2O package documentation, ask for help:
    > ??h2o
After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai
----------------------------------------------------------------------
Attaching package: ‘h2o’
The following objects are masked from ‘package:stats’:
    cor, sd, var
The following objects are masked from ‘package:base’:
    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

> h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
    /tmp/Rtmpi229cI/h2o_avkash_started_from_r.out
    /tmp/Rtmpi229cI/h2o_avkash_started_from_r.err
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster:
    H2O cluster uptime: 2 seconds 536 milliseconds
    H2O cluster version: 3.10.4.
    H2O cluster version age: 22 hours and 35 minutes
    H2O cluster name: H2O_started_from_R_avkash_tco537
    H2O cluster total nodes: 1
    H2O cluster total memory: 26.67 GB
    H2O cluster total cores: 32
    H2O cluster allowed cores: 2
    H2O cluster healthy: TRUE
    H2O Connection ip: localhost
    H2O Connection port: 54321
    H2O Connection proxy: NA
    H2O Internal Security: FALSE
    R Version: R version 3.3.3 (2017-03-06)

Note: As started, H2O is limited to the CRAN default of 2 CPUs.
       Shut down and restart H2O as shown below to use all your CPUs.
           > h2o.shutdown()
           > h2o.init(nthreads = -1)

> h2o.clusterStatus()

Version: 3.10.4.5
Cluster name: H2O_started_from_R_avkash_tco537
Cluster size: 1
Cluster is locked

h2o healthy last_ping num_cpus sys_load
1 localhost/127.0.0.1:54321 TRUE 1.492729e+12 32 0.88
  mem_value_size free_mem pojo_mem swap_mem free_disk max_disk pid

1 0 28537698304 93668352 0 47189065728 235825790976 25530
  num_keys tcps_active open_fds rpcs_active