Setting stopping criteria into H2O K-means

Sometime you may be looking for k-means stopping criteria, based off of “Number of Reassigned Observations Within Cluster”.

H2O K-means implementation has following 2 stopping criteria in k-means:

  1. Outer loop for estimate_k – stop when relative reduction of sum-of-within-centroid-sum-of-squares is small enough
  2. lloyds iteration – stops when relative fraction of reassigned points is small enough
In H2O Machine Learning library you just need to enabled _estimate_k to True and then have _max_iterations set to a very high number i.e. 100.
Using this combination, what happens is that algorithm will find best suitable K until it hits the max. There are no other fine-tuning parameters available.

In R here is what you can do:

h2o.kmeans(x = predictors, k = 100, estimate_k = T, standardize = F,
                          training_frame = train, validation_frame=valid, seed = 1234)

In Python here is what you can do:

iris_kmeans = H2OKMeansEstimator(k = 100, estimate_k = True, standardize = False, seed = 1234)
iris_kmeans.train(x = predictors, training_frame = train, validation_frame=valid)

In Java/Scala:

_estimate_k  = TRUE
_max_iterations = 100 (or a larger number.)

That’s it, enjoy!!

Calculating Standard Deviation using custom UDF and group by in H2O

Here is the full code to calculate standard deviation using H2O group by method as well as using customer UDF:

irisPath <- system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex <- h2o.uploadFile(path = irisPath, destination_frame = "iris.hex")

# Calculating Standard Deviation using h2o group by
SdValue <- h2o.group_by(data = iris.hex, by = "class", sd("sepal_len"))

# Printing result

# Alternative defining a UDF for Standard Deviation
mySDUdf <- function(df) { sd(df[,1],na.rm = T) }

# Using h2o ddply with UDF
SdValue <- h2o.ddply(iris.hex, "class", mySDUdf)

# Printing result

Thats it, enjoy!!

Calculate mean using UDF in H2O

Here is the full code to write a UDF to calculate mean for a given data frame using H2O machine learning platform:


ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)

# Writing the UDF
myMeanUDF = function(Fr) { mean(Fr[, 1]) }

# Applying UDF using ddply
MeanValue = h2o.ddply(australia.hex[, c("premax", "salmax", "Max_czcs")], c("premax", "salmax"), myMeanUDF)

# Printing Results

Thats it, enjoy!!

Anomaly Detection with Deep Learning in R with H2O

The following R script downloads ECG dataset (training and validation) from internet and perform deep learning based anomaly detection on it.

# Import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(
 path = "", 
 header = FALSE, 
 sep = ",")
test_ecg <- h2o.importFile(
 path = "", 
 header = FALSE, 
 sep = ",")

# Train deep autoencoder learning model on "normal" 
# training data, y ignored 
anomaly_model <- h2o.deeplearning(
 x = names(train_ecg), 
 training_frame = train_ecg, 
 activation = "Tanh", 
 autoencoder = TRUE, 
 hidden = c(50,20,50), 
 sparse = TRUE,
 l1 = 1e-4, 
 epochs = 100)

# Compute reconstruction error with the Anomaly 
# detection app (MSE between output and input layers)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)

# Pull reconstruction error data into R and 
# plot to find outliers (last 3 heartbeats)
recon_error <-

# Note: Testing = Reconstructing the test dataset
test_recon <- h2o.predict(anomaly_model, test_ecg) 

h2o.saveModel(anomaly_model, "/Users/avkashchauhan/learn/tmp/anomaly_model.bin")
h2o.download_pojo(anomaly_model, "/Users/avkashchauhan/learn/tmp/", get_jar = TRUE)

h2o.shutdown(prompt= FALSE)



Thats it, enjoy!!

Spark with H2O using rsparkling and sparklyr in R

You must have installed:

  • sparklyr
  • rsparkling


Here is the working script:

> options(rsparkling.sparklingwater.version = “2.1.6”)
> Sys.setenv(SPARK_HOME=’/Users/avkashchauhan/tools/spark-2.1.0-bin-hadoop2.6′)
> library(rsparkling)
> spark_disconnect(sc)
> sc <- spark_connect(master = “local”, version = “2.1.0”)

Testing the spark context:

[1] “local[8]”

[1] “shell”

[1] “sparklyr”

[1] 8

[1] 8

[1] “”

[1] “^1.*”

[1] “”

[1] “default”
[1] “/Library/Frameworks/R.framework/Versions/3.4/Resources/library/sparklyr/conf/config-template.yml”

[1] “/Volumes/OSxexT/tools/spark-2.1.0-bin-hadoop2.6”

A connection with
description “->localhost:53374”
class “sockconn”
mode “wb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

A connection with
description “->localhost:8880”
class “sockconn”
mode “rb”
text “binary”
opened “opened”
can read “yes”
can write “yes”

[1] “/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T//RtmpIIVL8I/file6ba7b454325_spark.log”

class org.apache.spark.SparkContext


class org.apache.spark.sql.SparkSession

[1] “spark_connection” “spark_shell_connection” “DBIConnection”
> h2o_context(sc, st)
Error in is.H2OFrame(x) : object ‘st’ not found
> h2o_context(sc, strict_version_check = FALSE)
class org.apache.spark.h2o.H2OContext

Sparkling Water Context:
* H2O name: sparkling-water-avkashchauhan_1672148412
* cluster size: 1
* list of used nodes:
(executorId, host, port)

Open H2O Flow in browser: (CMD + click in Mac OSX)

You can use the following command to launch H2O FLOW:

h2o_flow(sc, strict_version_check = FALSE)

Thats it, enjoy!!


Building GBM model in R and exporting POJO and MOJO model

Get the dataset:



Here is the script to build GBM grid model and export MOJO model:


# Importing Dataset
trainfile <- file.path("/Users/avkashchauhan/learn/adult_2013_train.csv.gz")
adult_2013_train <- h2o.importFile(trainfile, destination_frame = "adult_2013_train")
testfile <- file.path("/Users/avkashchauhan/learn/adult_2013_test.csv.gz")
adult_2013_test <- h2o.importFile(testfile, destination_frame = "adult_2013_test")

# Display Dataset

# Feature Engineering
actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
 adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
 adult_2013_test[[j]] <- as.factor(adult_2013_test[[j]])
predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

# Building GBM Model:
log_wagp_gbm_grid <- h2o.gbm(x = predset,
 y = "LOG_WAGP",
 training_frame = adult_2013_train,
 model_id = "GBMModel",
 distribution = "gaussian",
 max_depth = 5,
 ntrees = 110,
 validation_frame = adult_2013_test)


# Prediction 
h2o.predict(log_wagp_gbm_grid, adult_2013_test)

# Download POJO Model:
h2o.download_pojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

# Download MOJO model:
h2o.download_mojo(log_wagp_gbm_grid, "/Users/avkashchauhan/learn", get_genmodel_jar = TRUE)

You will see (as POJO Model) and (MOJO model) at the location where you will save these models.

Thats it, enjoy!


Saving H2O models from R/Python API in Hadoop Environment

When you are using H2O in clustered environment i.e. Hadoop the machine could be different where h2o.savemodel() is trying to write the model and thats why you see the error “No such file or directory”. If you just give the path i.e. /tmp and visit the machine ID where H2O connection is initiated from R, you will see the model stored there.
Here is a good example to understand it better:
Step [1] Starting Hadoop driver in EC2 environment as below:
[ec2-user@ip-10-0-104-179 ~]$ hadoop jar h2o- -nodes 2 -mapperXmx 2g -output /usr/ec2-user/005
Open H2O Flow in your web browser:  <=== H2O is started.
Note: Above you could see that hadoop command is ran on ip address however the node where H2O server is shown as
Step [2] Connect R client with H2O
> h2o.init(ip = "", port = 54323, strict_version_check = FALSE)
Note: I have used the ip address as shown above to connect with existing H2O cluster. However the machine where I am running R client is different as its IP address is
Step [3]: Saving H2O model:
h2o.saveModel(my.glm, path = "/tmp", force = TRUE)
So when I am saving the mode it is saved at machine even when the R client was running at
ec2-user@ip-10-0-65-248 ~]$ ll /tmp/GLM*
-rw-r--r-- 1 yarn hadoop 90391 Jun 2 20:02 /tmp/GLM_model_R_1496447892009_1
So you need to make sure you have access to a folder where H2O service is running or you can save model at HDFS something similar to as below:
h2o.saveModel(my.glm, path = "hdfs://", force = TRUE)

Thats it, enjoy!!