Setting H2O FLOW directory path in Sparkling Water

Sometimes you may want to back up H2O FLOW files to some source code repo or to a backup location. For that reason you may want to change the default FLOW directory.

In H2O flag -flow_dir is used to set the local folder for FLOW files.

Note: You can always specify any H2O property by using system properties on Spark driver/executors.

So to change H2O FLOW directory to save you can append to your command line with the Sparkling Water commandline:

--conf spark.driver.extraJavaOptions="-Dai.h2o.flow_dir=/your/backup/location"

 

Thats it, thanks.

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
                                    beta_constraints=bc,
                                    standardize=False)
glm.coef()
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Creating Partial Dependency Plot (PDP) in H2O

Starting from H2O 3.10.0.8 H2O added partial dependency plot which has the Java backend to do the mutli-scoring of the dataset with the model. This makes creating PDP much faster.

To get PDP in H2O you must need Model, and the original data set used to generate mode. Here are few ways to create PDP:

If you want to generate PDP on a single column:

response = h2o.predict(model, data.pdp[, column_name])
To generate PDP on the original data set:
response = h2o.predict(model, data.pdp)
If you want to build PDP directly from Model and dataset without using PDP API, you can the following code:
model = prostate.gbm
column_name = "AGE"
data.pdp = data.hex
bins = unique(h2o.quantile(data.hex[, column_name], probs = seq(0.05,1,0.05)) )
mean_responses = c()

for(bin in bins ){
  data.pdp[, column_name] = bin
  response = h2o.predict(model, data.pdp[, column_name])
  mean_response = mean(response[,ncol(response)])
  mean_responses = c(mean_responses, mean_response)
}

pdp_manual = data.frame(AGE = bins, mean_response = mean_responses)
plot(pdp_manual, type = "l")
Thats it, enjoy!!

Grid Search for Naive Bayes in R using H2O

Here is a R sample code to show how to perform grid search in Naive Bayes algorithm using H2O machine learning platform:

# H2O
library(h2o)
library(ggplot2)
library(data.table)
 
# initialize the cluster with all the threads available
h2o.init(nthreads = -1)
h2o.init()
h2o.init(max_mem_size = "2g")
 
# Variables Necesarias
train.h2o<-as.h2o(training)
test.h2o <-as.h2o(testing)
names(train.h2o)
str(train.h2o)
 
y <-4
x <-c(5:16)
 
# specify the list of paramters
hyper_params <- list(
 laplace = c(0,0.5,1,2,3)
)
 
threshold =c(0.001,0.00001,0.0000001)
 
# performs the grid search
grid_id <-"dl_grid"
model_bayes_grid <- h2o.grid(
 algorithm = "naivebayes", # name of the algorithm
 grid_id = grid_id,
 training_frame = train.h2o,
 validation_frame = test.h2o,
 x = x,
 y = y,
 hyper_params = hyper_params
)
 
# find the best model and evaluate its performance
stopping_metric <- 'accuracy'
sorted_models <- h2o.getGrid(
 grid_id = grid_id,
 sort_by = stopping_metric,
 decreasing = TRUE
)
 
best_model<-h2o.getModel(sorted_models@model_ids[[1]])
best_model
 
h2o.confusionMatrix(best_model, valid = TRUE, metrics = 'accuracy')
 

auc <- h2o.auc(best_model, valid = TRUE)
fpr <- h2o.fpr( h2o.performance(best_model, valid = TRUE) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(best_model, valid = TRUE) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
 geom_line() + theme_bw()+ggtitle( sprintf('AUC: %f', auc) )
 

# To obtain the regularization, laplace, do the following:
best_model@parameters
 best_model@parameters$laplace

Thats it, enjoy!!

Filtering H2O data frame on multiple fields of date and int type

Lets create an H2O frame using h2o.create_frame API:

df = h2o.create_frame(time_fraction = .1,rows=10, cols = 10)

Above will create a frame of 10 rows and 10 columns and based on time_fraction values 0.1 (1 out of 10 provided columns) will be date/time columns. The data frame looks as below:

Screen Shot 2017-04-27 at 1.20.12 PM

Here are few example filtering scripts:

df1 = df[ (df['C4'] > 0) & (df['C7'] < 10)]
df2 = df[ (df['C4'] > 0) & (df['C7'] < 10)   & (df['C9'] > datetime.datetime(2000,1,1))  ]
df2 = df[ ((df['C4'] > 0) | (df['C7'] < 10)) & (df['C9'] > datetime.datetime(2000,1,1)) ]

and the screenshot:

Screen Shot 2017-04-27 at 1.19.09 PM

Thats it, enjoy!!

Building high order polynomials with GLM for higher accuracy

Sometimes when building GLM models, you would like to configure GLM to search for higher order polynomial of the features .

The reason you may have to do is that, you may have strong predictors for a model and going for high order polynomial of predictors you will get higher accuracy.

With H2O, you can create higher order polynomials as below:

  • Look for  ‘interactions’ parameter in GLM model.
  • In the interaction parameters add  list of predictor columns to interact.
When model will be build, all pairwise combinations will be computed for this list. Following is a working sample:
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
predictors = boston.columns[:-1]
response = "medv"
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
interactions_list = ['crim', 'dis']
boston_glm = H2OGeneralizedLinearEstimator(interactions = interactions_list)
boston_glm.train(x = predictors, y = response,training_frame = boston)
boston_glm.coef()
To explore interactions among categorical variables please do the following:
h2o.interaction
Thats all, enjoy!!