Setting stopping criteria into H2O K-means

Sometime you may be looking for k-means stopping criteria, based off of “Number of Reassigned Observations Within Cluster”.

H2O K-means implementation has following 2 stopping criteria in k-means:

  1. Outer loop for estimate_k – stop when relative reduction of sum-of-within-centroid-sum-of-squares is small enough
  2. lloyds iteration – stops when relative fraction of reassigned points is small enough
In H2O Machine Learning library you just need to enabled _estimate_k to True and then have _max_iterations set to a very high number i.e. 100.
Using this combination, what happens is that algorithm will find best suitable K until it hits the max. There are no other fine-tuning parameters available.

In R here is what you can do:

h2o.kmeans(x = predictors, k = 100, estimate_k = T, standardize = F,
                          training_frame = train, validation_frame=valid, seed = 1234)

In Python here is what you can do:

iris_kmeans = H2OKMeansEstimator(k = 100, estimate_k = True, standardize = False, seed = 1234)
iris_kmeans.train(x = predictors, training_frame = train, validation_frame=valid)

In Java/Scala:

_estimate_k  = TRUE
_max_iterations = 100 (or a larger number.)

That’s it, enjoy!!

Handling YARN resources manager issue with decommissioned nodes

If you hit the following exception with your YARN resource manager:

ERROR/Exception:

17/07/31 15:06:13 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes over rm1. Not retrying because try once and fail.
java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl

Troubleshooting:

Please try running the following command and you will see the exact same exception:

$ yarn node -list -all

Root Cause:

This problem happen when your YARN cluster have decommissioned nodes and it could cause issue with other dependent application i.e. H2O to not to start.

Solution:

Please make sure all the decommissioned nodes are either not listed or added back as full service nodes.
That’s it, enjoy!!

Scoring H2O model with TIBCO StreamBase

If you are using H2O models with StreamBase for scoring this is what you have to do:

  1. Get the Model as Java Code (POJO Model)
  2. Get the h2o-genmodel.jar (Download from the H2O cluster)
    1. Alternatively you can use the REST api (works in every H2O version) as below to download h2o-genmodel.jar:
      curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
  3. Create the project in StreamBase add H2O Model Java to the project (POJO)
  4. Change the H2O operator to using the POJO in Streambase.
  5. Adding h2o-genmodel.jar to the project’s Java Build Path\libraries

After that you can use H2O Model into StreamBase.

That’s it, enjoy!!

Calculating Standard Deviation using custom UDF and group by in H2O

Here is the full code to calculate standard deviation using H2O group by method as well as using customer UDF:

library(h2o)
h2o.init()
irisPath <- system.file("extdata", "iris_wheader.csv", package = "h2o")
iris.hex <- h2o.uploadFile(path = irisPath, destination_frame = "iris.hex")

# Calculating Standard Deviation using h2o group by
SdValue <- h2o.group_by(data = iris.hex, by = "class", sd("sepal_len"))

# Printing result
SdValue

# Alternative defining a UDF for Standard Deviation
mySDUdf <- function(df) { sd(df[,1],na.rm = T) }

# Using h2o ddply with UDF
SdValue <- h2o.ddply(iris.hex, "class", mySDUdf)

# Printing result
SdValue

Thats it, enjoy!!

Calculate mean using UDF in H2O

Here is the full code to write a UDF to calculate mean for a given data frame using H2O machine learning platform:

 

library(h2o)
h2o.init()
ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)

# Writing the UDF
myMeanUDF = function(Fr) { mean(Fr[, 1]) }

# Applying UDF using ddply
MeanValue = h2o.ddply(australia.hex[, c("premax", "salmax", "Max_czcs")], c("premax", "salmax"), myMeanUDF)

# Printing Results
MeanValue

Thats it, enjoy!!

Getting individual metrics from H2O model in Python

You can get some of the individual model metrics for your model based on training and/or validation data. Here is the code snippet:

Note: I am creating a test data frame to run H2O Deep Learning algorithm and then showing how to collect individual model metrics based on training and/or validation data below.

import h2o
h2o.init(strict_version_check= False , port = 54345)
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
model = H2ODeepLearningEstimator()
rows = [[1,2,3,4,0], [2,1,2,4,1], [2,1,4,2,1], [0,1,2,34,1], [2,3,4,1,0]] * 50
fr = h2o.H2OFrame(rows)
X = fr.col_names[0:4]

## Classification Model
fr[4] = fr[4].asfactor()
model.train(x=X, y="C5", training_frame=fr)
print('Model Type:', model.type)
print('logloss', model.logloss(valid = False))
print('Accuracy', model.accuracy(valid = False))
print('AUC', model.auc(valid = False))
print('R2', model.r2(valid = False))
print('RMSE', model.rmse(valid = False))
print('Error', model.error(valid = False))
print('MCC', model.mcc(valid = False))

## Regression Model
fr = h2o.H2OFrame(rows)
model.train(x=X, y="C5", training_frame=fr)
print('Model Type:', model.type)
print('R2', model.r2(valid = False))
print('RMSE', model.rmse(valid = False))
 

Note: As I did not pass validation frame thats why I set valid = False to get training metrics. If you have passed validation metrics then you can set valid = True to get validation metrics as well.

If you want to see what is inside model object you can look at the json object as below:

model.get_params()

Thats it, enjoy!!

 

Adding hyper parameter to Deep Learning algorithm in H2O with Scala

Hidden layer is hyper parameter for Deep Learning algorithm in H2O and to use hidden layer setting in H2O based deep learning you should be using “_hidden” parameter to specify the hidden later as hyper parameter as below:
val hyperParms = collection.immutable.HashMap("_hidden" -> hidden_layers)

Here is the code snippet in Scala to add hidden layers H2O based Deep learning Algorithm:

val dlParams = new DeepLearningParameters()
dlParams._train = airlinesData
dlParams._activation =  DeepLearningModel.DeepLearningParameters.Activation.Tanh
dlParams._epochs = 1
dlParams._autoencoder = true

dlParams._ignore_const_cols = false
dlParams._stopping_rounds = 0

dlParams._score_training_samples = 0
dlParams._replicate_training_data = false
dlParams._standardize = true

import collection.JavaConversions._
val hidden_layers = Array(Array(1, 5, 1), Array(1, 6, 1), Array(1, 7, 1)).map(_.asInstanceOf[Object])
val hyperParms = collection.immutable.HashMap("_hidden" -> hidden_layers)

def let[A](in: A)(body: A => Unit) = {
    body(in)
    in
  }

import _root_.hex.grid.GridSearch
import _root_.hex.grid.HyperSpaceSearchCriteria.RandomDiscreteValueSearchCriteria
import _root_.hex.ScoreKeeper
    val intRateHyperSpaceCriteria = let(new RandomDiscreteValueSearchCriteria) { c =>
      c.set_stopping_metric(ScoreKeeper.StoppingMetric.RMSE)
      c.set_stopping_tolerance(0.001)
      c.set_stopping_rounds(3)
      c.set_max_runtime_secs(40 * 60 /* seconds */)
      c.set_max_models(3)
    }

val intRateGrid = GridSearch.startGridSearch(Key.make("intRateGridModel"),
                                                 dlParams,
                                                 hyperParms,
                                                 new GridSearch.SimpleParametersBuilderFactory[DeepLearningParameters],
                                                 intRateHyperSpaceCriteria).get()
val count = intRateGrid.getModelCount()
Thats it, enjoy!!