Setting stopping criteria into H2O K-means

Sometime you may be looking for k-means stopping criteria, based off of “Number of Reassigned Observations Within Cluster”.

H2O K-means implementation has following 2 stopping criteria in k-means:

  1. Outer loop for estimate_k – stop when relative reduction of sum-of-within-centroid-sum-of-squares is small enough
  2. lloyds iteration – stops when relative fraction of reassigned points is small enough
In H2O Machine Learning library you just need to enabled _estimate_k to True and then have _max_iterations set to a very high number i.e. 100.
Using this combination, what happens is that algorithm will find best suitable K until it hits the max. There are no other fine-tuning parameters available.

In R here is what you can do:

h2o.kmeans(x = predictors, k = 100, estimate_k = T, standardize = F,
                          training_frame = train, validation_frame=valid, seed = 1234)

In Python here is what you can do:

iris_kmeans = H2OKMeansEstimator(k = 100, estimate_k = True, standardize = False, seed = 1234)
iris_kmeans.train(x = predictors, training_frame = train, validation_frame=valid)

In Java/Scala:

_estimate_k  = TRUE
_max_iterations = 100 (or a larger number.)

That’s it, enjoy!!

Handling YARN resources manager issue with decommissioned nodes

If you hit the following exception with your YARN resource manager:

ERROR/Exception:

17/07/31 15:06:13 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes over rm1. Not retrying because try once and fail.
java.lang.ClassCastException: org.apache.hadoop.yarn.server.resourcemanager.NodesListManager$UnknownNodeId cannot be cast to org.apache.hadoop.yarn.api.records.impl.pb.NodeIdPBImpl

Troubleshooting:

Please try running the following command and you will see the exact same exception:

$ yarn node -list -all

Root Cause:

This problem happen when your YARN cluster have decommissioned nodes and it could cause issue with other dependent application i.e. H2O to not to start.

Solution:

Please make sure all the decommissioned nodes are either not listed or added back as full service nodes.
That’s it, enjoy!!

Scoring H2O model with TIBCO StreamBase

If you are using H2O models with StreamBase for scoring this is what you have to do:

  1. Get the Model as Java Code (POJO Model)
  2. Get the h2o-genmodel.jar (Download from the H2O cluster)
    1. Alternatively you can use the REST api (works in every H2O version) as below to download h2o-genmodel.jar:
      curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
  3. Create the project in StreamBase add H2O Model Java to the project (POJO)
  4. Change the H2O operator to using the POJO in Streambase.
  5. Adding h2o-genmodel.jar to the project’s Java Build Path\libraries

After that you can use H2O Model into StreamBase.

That’s it, enjoy!!

Adding hyper parameter to Deep Learning algorithm in H2O with Scala

Hidden layer is hyper parameter for Deep Learning algorithm in H2O and to use hidden layer setting in H2O based deep learning you should be using “_hidden” parameter to specify the hidden later as hyper parameter as below:
val hyperParms = collection.immutable.HashMap("_hidden" -> hidden_layers)

Here is the code snippet in Scala to add hidden layers H2O based Deep learning Algorithm:

val dlParams = new DeepLearningParameters()
dlParams._train = airlinesData
dlParams._activation =  DeepLearningModel.DeepLearningParameters.Activation.Tanh
dlParams._epochs = 1
dlParams._autoencoder = true

dlParams._ignore_const_cols = false
dlParams._stopping_rounds = 0

dlParams._score_training_samples = 0
dlParams._replicate_training_data = false
dlParams._standardize = true

import collection.JavaConversions._
val hidden_layers = Array(Array(1, 5, 1), Array(1, 6, 1), Array(1, 7, 1)).map(_.asInstanceOf[Object])
val hyperParms = collection.immutable.HashMap("_hidden" -> hidden_layers)

def let[A](in: A)(body: A => Unit) = {
    body(in)
    in
  }

import _root_.hex.grid.GridSearch
import _root_.hex.grid.HyperSpaceSearchCriteria.RandomDiscreteValueSearchCriteria
import _root_.hex.ScoreKeeper
    val intRateHyperSpaceCriteria = let(new RandomDiscreteValueSearchCriteria) { c =>
      c.set_stopping_metric(ScoreKeeper.StoppingMetric.RMSE)
      c.set_stopping_tolerance(0.001)
      c.set_stopping_rounds(3)
      c.set_max_runtime_secs(40 * 60 /* seconds */)
      c.set_max_models(3)
    }

val intRateGrid = GridSearch.startGridSearch(Key.make("intRateGridModel"),
                                                 dlParams,
                                                 hyperParms,
                                                 new GridSearch.SimpleParametersBuilderFactory[DeepLearningParameters],
                                                 intRateHyperSpaceCriteria).get()
val count = intRateGrid.getModelCount()
Thats it, enjoy!!

Using H2O models into Java for scoring or prediction

This sample generate a GBM model from R H2O library and then consume the model into Java for prediction.

Here is R Script to generate sample model using H2O

setwd("/tmp/resources/")
library(h2o)
h2o.init()
df = iris
h2o_df = as.h2o(df)
y = "Species"
x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
model = h2o.gbm(y = y, x = x, training_frame = h2o_df)
model
h2o.download_mojo(model, get_genmodel_jar = TRUE)

Here is the Java code to use Model for prediction:

import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;

public class main {
    static void printIt(String message, MultinomialModelPrediction p) {
        System.out.println("");
        System.out.println(message);
        for (int i = 0; i < p.classProbabilities.length; i++) {
            if (i > 0) {
                System.out.print(",");
            }
            System.out.print(p.classProbabilities[i]);
        }
        System.out.println("");
    }
    public static void main(String[] args) throws Exception {
        EasyPredictModelWrapper model_orig = new EasyPredictModelWrapper(MojoModel.load("unzipped_orig"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All 1s, orig", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("All NAs, orig", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");

            MultinomialModelPrediction p = model_orig.predictMultinomial(row);
            printIt("Sepal width NA, orig", p);
        }
        // -------------------
        EasyPredictModelWrapper model_modified = new EasyPredictModelWrapper(MojoModel.load("unzipped_modified"));
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("sepwid", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All 1s (with sepwid instead of Sepal.Width), modified", p);
        }
        {
            RowData row = new RowData();
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("All NAs, modified", p);
        }
        {
            RowData row = new RowData();
            row.put("Sepal.Length", "1");
            row.put("Sepal.Width", "1");
            row.put("Petal.Length", "1");
            row.put("Petal.Width", "1");
            MultinomialModelPrediction p = model_modified.predictMultinomial(row);
            printIt("Sepal width NA (with Sepal.Width instead of sepwid), modified", p);
        }
    }
}

After the MOJO is downloaded you can see the model.ini as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

If you decided to modify model.ini by renaming column (i.e.sepal.width to sepwid) you can do as below:

[info]
h2o_version = 3.10.4.8
mojo_version = 1.20
license = Apache License Version 2.0
algo = gbm
algorithm = Gradient Boosting Machine
endianness = LITTLE_ENDIAN
category = Multinomial
uuid = 7712689150025610456
supervised = true
n_features = 4
n_classes = 3
n_columns = 5
n_domains = 1
balance_classes = false
default_threshold = 0.5
prior_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
model_class_distrib = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
timestamp = 2017-05-23T08:19:42.961-07:00
n_trees = 50
n_trees_per_class = 3
distribution = multinomial
init_f = 0.0
offset_column = null

[columns]
Sepal.Length
SepWid
Petal.Length
Petal.Width
Species

[domains]
4: 3 d000.txt

Now we can run the Java commands to test the code as below:

$ java -cp .:h2o-genmodel.jar main

All 1s, orig
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, orig
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA, orig
0.7704658301004306,0.19829292017147707,0.03124124972809238

All 1s (with sepwid instead of Sepal.Width), modified
0.7998234476072545,0.15127335891610785,0.04890319347663747

All NAs, modified
0.009344361534466918,0.9813250958541073,0.009330542611425827

Sepal width NA (with Sepal.Width instead of sepwid), modified
0.7704658301004306,0.19829292017147707,0.03124124972809238
 Thats it, enjoy!!

How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
                                    beta_constraints=bc,
                                    standardize=False)
glm.coef()
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!