Setting stopping criteria into H2O K-means

Sometime you may be looking for k-means stopping criteria, based off of “Number of Reassigned Observations Within Cluster”.

H2O K-means implementation has following 2 stopping criteria in k-means:

  1. Outer loop for estimate_k – stop when relative reduction of sum-of-within-centroid-sum-of-squares is small enough
  2. lloyds iteration – stops when relative fraction of reassigned points is small enough
In H2O Machine Learning library you just need to enabled _estimate_k to True and then have _max_iterations set to a very high number i.e. 100.
Using this combination, what happens is that algorithm will find best suitable K until it hits the max. There are no other fine-tuning parameters available.

In R here is what you can do:

h2o.kmeans(x = predictors, k = 100, estimate_k = T, standardize = F,
                          training_frame = train, validation_frame=valid, seed = 1234)

In Python here is what you can do:

iris_kmeans = H2OKMeansEstimator(k = 100, estimate_k = True, standardize = False, seed = 1234)
iris_kmeans.train(x = predictors, training_frame = train, validation_frame=valid)

In Java/Scala:

_estimate_k  = TRUE
_max_iterations = 100 (or a larger number.)

That’s it, enjoy!!

Scoring H2O model with TIBCO StreamBase

If you are using H2O models with StreamBase for scoring this is what you have to do:

  1. Get the Model as Java Code (POJO Model)
  2. Get the h2o-genmodel.jar (Download from the H2O cluster)
    1. Alternatively you can use the REST api (works in every H2O version) as below to download h2o-genmodel.jar:
      curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
  3. Create the project in StreamBase add H2O Model Java to the project (POJO)
  4. Change the H2O operator to using the POJO in Streambase.
  5. Adding h2o-genmodel.jar to the project’s Java Build Path\libraries

After that you can use H2O Model into StreamBase.

That’s it, enjoy!!

Setting H2O FLOW directory path in Sparkling Water

Sometimes you may want to back up H2O FLOW files to some source code repo or to a backup location. For that reason you may want to change the default FLOW directory.

In H2O flag -flow_dir is used to set the local folder for FLOW files.

Note: You can always specify any H2O property by using system properties on Spark driver/executors.

So to change H2O FLOW directory to save you can append to your command line with the Sparkling Water commandline:

--conf spark.driver.extraJavaOptions="-Dai.h2o.flow_dir=/your/backup/location"

 

Thats it, thanks.

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
                                    beta_constraints=bc,
                                    standardize=False)
glm.coef()
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Creating Partial Dependency Plot (PDP) in H2O

Starting from H2O 3.10.0.8 H2O added partial dependency plot which has the Java backend to do the mutli-scoring of the dataset with the model. This makes creating PDP much faster.

To get PDP in H2O you must need Model, and the original data set used to generate mode. Here are few ways to create PDP:

If you want to generate PDP on a single column:

response = h2o.predict(model, data.pdp[, column_name])
To generate PDP on the original data set:
response = h2o.predict(model, data.pdp)
If you want to build PDP directly from Model and dataset without using PDP API, you can the following code:
model = prostate.gbm
column_name = "AGE"
data.pdp = data.hex
bins = unique(h2o.quantile(data.hex[, column_name], probs = seq(0.05,1,0.05)) )
mean_responses = c()

for(bin in bins ){
  data.pdp[, column_name] = bin
  response = h2o.predict(model, data.pdp[, column_name])
  mean_response = mean(response[,ncol(response)])
  mean_responses = c(mean_responses, mean_response)
}

pdp_manual = data.frame(AGE = bins, mean_response = mean_responses)
plot(pdp_manual, type = "l")
Thats it, enjoy!!

Grid Search for Naive Bayes in R using H2O

Here is a R sample code to show how to perform grid search in Naive Bayes algorithm using H2O machine learning platform:

# H2O
library(h2o)
library(ggplot2)
library(data.table)
 
# initialize the cluster with all the threads available
h2o.init(nthreads = -1)
h2o.init()
h2o.init(max_mem_size = "2g")
 
# Variables Necesarias
train.h2o<-as.h2o(training)
test.h2o <-as.h2o(testing)
names(train.h2o)
str(train.h2o)
 
y <-4
x <-c(5:16)
 
# specify the list of paramters
hyper_params <- list(
 laplace = c(0,0.5,1,2,3)
)
 
threshold =c(0.001,0.00001,0.0000001)
 
# performs the grid search
grid_id <-"dl_grid"
model_bayes_grid <- h2o.grid(
 algorithm = "naivebayes", # name of the algorithm
 grid_id = grid_id,
 training_frame = train.h2o,
 validation_frame = test.h2o,
 x = x,
 y = y,
 hyper_params = hyper_params
)
 
# find the best model and evaluate its performance
stopping_metric <- 'accuracy'
sorted_models <- h2o.getGrid(
 grid_id = grid_id,
 sort_by = stopping_metric,
 decreasing = TRUE
)
 
best_model<-h2o.getModel(sorted_models@model_ids[[1]])
best_model
 
h2o.confusionMatrix(best_model, valid = TRUE, metrics = 'accuracy')
 

auc <- h2o.auc(best_model, valid = TRUE)
fpr <- h2o.fpr( h2o.performance(best_model, valid = TRUE) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(best_model, valid = TRUE) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
 geom_line() + theme_bw()+ggtitle( sprintf('AUC: %f', auc) )
 

# To obtain the regularization, laplace, do the following:
best_model@parameters
 best_model@parameters$laplace

Thats it, enjoy!!