Getting individual metrics from H2O model in Python

You can get some of the individual model metrics for your model based on training and/or validation data. Here is the code snippet:

Note: I am creating a test data frame to run H2O Deep Learning algorithm and then showing how to collect individual model metrics based on training and/or validation data below.

import h2o
h2o.init(strict_version_check= False , port = 54345)
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
model = H2ODeepLearningEstimator()
rows = [[1,2,3,4,0], [2,1,2,4,1], [2,1,4,2,1], [0,1,2,34,1], [2,3,4,1,0]] * 50
fr = h2o.H2OFrame(rows)
X = fr.col_names[0:4]

## Classification Model
fr[4] = fr[4].asfactor()
model.train(x=X, y="C5", training_frame=fr)
print('Model Type:', model.type)
print('logloss', model.logloss(valid = False))
print('Accuracy', model.accuracy(valid = False))
print('AUC', model.auc(valid = False))
print('R2', model.r2(valid = False))
print('RMSE', model.rmse(valid = False))
print('Error', model.error(valid = False))
print('MCC', model.mcc(valid = False))

## Regression Model
fr = h2o.H2OFrame(rows)
model.train(x=X, y="C5", training_frame=fr)
print('Model Type:', model.type)
print('R2', model.r2(valid = False))
print('RMSE', model.rmse(valid = False))
 

Note: As I did not pass validation frame thats why I set valid = False to get training metrics. If you have passed validation metrics then you can set valid = True to get validation metrics as well.

If you want to see what is inside model object you can look at the json object as below:

model.get_params()

Thats it, enjoy!!

 

Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal   
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON
In the above AUC object:
_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false
Now you can use individual ROC specific values as below to recreate ROC:
trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths
To print the whole array in the terminal for inspection, you just need the following:
val dd = trainMetrics._auc._fps
println(dd.mkString(" "))
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
_p    =  actual trues

_n    =  actual false
scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0
Thats it, enjoy!!

Multinomial classification example in Scala and Deep Learning with H2O

Here is a sample for multinomial classification problem using H2O Deep Learning algorithm and iris data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.ModelMetricsMultinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(irisData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildDLModel(train: Frame, valid: Frame, response: String,
 epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
 hidden: Array[Int] = Array[Int](200, 200))
 (implicit h2oContext: H2OContext): DeepLearningModel = {
 import h2oContext.implicits._
 // Build a model
 import _root_.hex.deeplearning.DeepLearning
 import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
 val dlParams = new DeepLearningParameters()
 dlParams._train = train
 dlParams._valid = valid
 dlParams._response_column = response
 dlParams._epochs = epochs
 dlParams._l1 = l1
 dlParams._hidden = hidden
 // Create a job
 val dl = new DeepLearning(dlParams, Key.make("dlModel.hex"))
 dl.trainModel.get
}


// Note: The response column name is C5 here so passing:
val dlModel = buildDLModel(train, valid, 'C5)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsMultinomial](dlModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)

Thats it, enjoy!!

Just upgraded Tensorflow 1.0.1 and Keras 2.0.1

$ pip install –upgrade keras –user

Collecting keras
 Downloading Keras-2.0.1.tar.gz (192kB)
 100% |████████████████████████████████| 194kB 2.9MB/s
Requirement already up-to-date: theano in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: pyyaml in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: six in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: numpy>=1.7.1 in /usr/local/lib/python2.7/dist-packages (from theano->keras)
Collecting scipy>=0.11 (from theano->keras)
 Downloading scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl (45.0MB)
 100% |████████████████████████████████| 45.0MB 34kB/s
Building wheels for collected packages: keras
 Running setup.py bdist_wheel for keras ... done
 Stored in directory: /home/avkash/.cache/pip/wheels/fa/15/f9/57473734e407749529bf55e6b5038640dc7279d5718b2c368a
Successfully built keras
Installing collected packages: keras, scipy
 Found existing installation: Keras 1.2.2
 Uninstalling Keras-1.2.2:
 Successfully uninstalled Keras-1.2.2
Successfully installed keras-2.0.1 scipy-0.19.0

$ pip install –upgrade tensorflow-gpu –user

Collecting tensorflow-gpu
 Downloading tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl (94.8MB)
 100% |████████████████████████████████| 94.8MB 16kB/s
Requirement already up-to-date: mock>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: numpy>=1.11.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: protobuf>=3.1.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: wheel in /usr/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: six>=1.10.0 in ./.local/lib/python2.7/site-packages (from tensorflow-gpu)
Requirement already up-to-date: funcsigs>=1; python_version < "3.3" in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow-gpu)
Requirement already up-to-date: pbr>=0.11 in ./.local/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow-gpu)
Requirement already up-to-date: setuptools in ./.local/lib/python2.7/site-packages (from protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: appdirs>=1.4.0 in ./.local/lib/python2.7/site-packages (from setuptools->protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: packaging>=16.8 in ./.local/lib/python2.7/site-packages (from setuptools->protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: pyparsing in ./.local/lib/python2.7/site-packages (from packaging>=16.8->setuptools->protobuf>=3.1.0->tensorflow-gpu)
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.0.1

$ python -c ‘import keras as tf;print tf.version

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
2.0.1

$ python -c ‘import tensorflow as tf;print tf.version

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.1

Deep Learning session from Google Next 2017

TensorFlow and Deep Learning without a PhD:

With TensorFlow, deep machine learning transitions from an area of research to mainstream software engineering. In this video, Martin Gorner demonstrates how to construct and train a neural network that recognizes handwritten digits. Along the way, he’ll describe some “tricks of the trade” used in neural network design, and finally, he’ll bring the recognition accuracy of his model above 99%.

Part 1:

Part 2:

 

Version mismatch error during H2O initialization

When you initialize  H2O in R or python as below:

> h2o.init()

You may see the following error:

Error in h2o.init(nthreads = -1) : Version mismatch! H2O is running version 3.11.0.99999 but h2o-R package is version 3.10.4.1.
This is a developer build, please contact your developer

You are seeing this error is because R is trying to connect an existing running H2O instance instead of the starting a new instance based on your R package.

If you look for H2O running in your environment you will find one. You can also try to following command to see running h2o process

$ ps -ef | grep h2o

Now if you turn off running H2O instance, and re-run the above command you will see a new instance of H2O is started.

The other option is to disable the strict version check of a running H2O instance and just connect with it as below (You can disable version checking on both R and python):

> h2o.init(strict_version_check = FALSE)

Treatment of categorical variables in H2O’s DRF algorithm

drf-tree

In DRF categorical_encoding is exposed and the explanation i here:

http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/2/docs-website/h2o-docs/data-science/algo-params/categorical_encoding.html.

Question: What is the meaning of AUTO (let the algorithm decide) in DRF?

Answer: Based on the link from our source: https://github.com/h2oai/h2o-3/blob/405f5639360e1977027a04cc8f99da239c460907/h2o-docs/src/product/data-science/algo-params/categorical_encoding.rst

GBM/DRF/K-Means: auto or AUTO: Allow the algorithm to decide (default). For GBM, DRF, and K-Means, the algorithm will perform Enum encoding when auto option is specified.

Question: Could you explain how eigen encoding works, i.e. have you a good online reference?

Answer : eigen or Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only Eigen uses k=1 only for now

Question: Are there any recommended techniques for randomising the ordering of the categoricals? Let’s say that the categoricals are US States and that large discriminative power comes from separating Alabama and Alaska, but no discrimation comes from separating {AL, AK} from the rest. With nbins_cat set to 5, say, (a compromise across all categoricals) it is likely that the grouping for {AL, AK} vs the other states won’t ever be selected, hence AL will never been considered separate to AK. Obviously in this particular case we can engineer the data, but in general this could be problem.

Following the link you give, the docs say

enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits – either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split.

I wonder what is meant by via ‘bitsets that do a perfect group split’? I have noticed this by examining the model POJO output, though I cannot find this behaviour documented. If the categories are letters of the English alphabet, a=1,…,z=26, then I’ve noticed that groups can be split by appropriate bags of letters (e.g. a split might send a, e, f, x one way with other letters and NA going the other way). Clearly it cannot be doing an exhaustive search over all possible combinations of letters a-z to form the optimal group. But neither is it only looking at the ordinates.

Answer: If nbins_cat is 5 for 52 categorical levels, then there won’t be any bitsets used for splitting the categorical levels. Instead, nbins_cats (5) split points will be considered for splitting the levels into

  • {A … D} Left vs {E … Z} Right
  • {A … M} Left vs {N … Z} Right

The 5 split points are uniformly spaced across all levels present in the node (at the root, that’s A … Z) – those are simple “less-than” splits in the integer space of the levels.

If one of the nbins_cats splits ends up being the best split for the given tree node (across all selected columns of the data), then the next level split decision will have fewer levels to split, and so on. For example, the left node might contain only {A … D} (asuming the first split point was chosen above).

This smaller set of levels will be able to be resolved with nbins_cats = 5, an then a bitset split is created, that looks like this:

  • A : Left
  • B : Right
  • C : Right
  • D : Left

Yes, this is optimal (for every level, we know the training data behavior and can choose to send the points left or right), but without doing an exhaustive search.

The point here is that nbins_cats is an important tuning parameter as it will lead to “perfect” splits once it’s big enough to resolve the categorical levels. Otherwise, you have to hope that the “less-than” splits will lead to good-enough separation to eventually get to the perfect bitsets.