Exploring & transforming H2O Data Frame in R and Python

Sometime you may need to ingest a dataset for building models and then your first task is to explore all the features and their type you have. Once that is done you may want to change the feature types to the one you want.

Here is the code snippet in Python:

df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
df.types
{    u'AGE': u'int', u'CAPSULE': u'int', u'DCAPS': u'int', 
     u'DPROS': u'int', u'GLEASON': u'int', u'ID': u'int',
     u'PSA': u'real', u'RACE': u'int', u'VOL': u'real'
}
If you would like to visualize all the features in graphical format you can do the following:
import pylab as pl
df.as_data_frame().hist(figsize=(20,20))
pl.show()
The result looks like as below on jupyter notebook:
Screen Shot 2017-10-05 at 5.20.03 PM
Note: If you have features above 50, you might have to trim your data frame to less features so you can have effective visualization.
Next you may need to You can also use the following function to convert a list of columns as factor/categorical by passing H2O dataframe and a list of columns:
def convert_columns_as_factor(hdf, column_list):
    list_count = len(column_list)
    if list_count is 0:
        return "Error: You don't have a list of binary columns."
    if (len(pdf.columns)) is 0:
        return "Error: You don't have any columns in your data frame."
    local_column_list = pdf.columns
    for i in range(list_count):
        try:
            target_index = local_column_list.index(column_list[i])
            pdf[column_list[i]] = pdf[column_list[i]].asfactor()
            print('Column ' + column_list[i] + " is converted into factor/catagorical.")
        except ValueError:
            print('Error: ' + str(column_list[i]) + " not found in the data frame.")

The following script is in R to perform the same above tasks:

N=100
set.seed(999)
color = sample(c("D","E","I","F","M"),size=N,replace=TRUE)
num = rnorm(N,mean = 12,sd = 21212)
sex = sample(c("male","female"),size=N,replace=TRUE)
sex = as.factor(sex)
color = as.factor(color)
data = sample(c(0,1),size = N,replace = T)
fdata = factor(data)
table(fdata)
dd = data.frame(color,sex,num,fdata)
data = as.h2o(dd)
str(data)
data$sex = h2o.setLevels(x = data$sex ,levels = c("F","M"))
data
Thats it, enjoy!!

H2O Word2Vec Tutorial with example in Scala

If you would like to know what is word2vec and why you should use it, there is lots of material available to scan.  You can learn more about H2O implementation of Word2Vec here, along with its configuration and interpretation.

In this Scala example we will use H2O Word2Vec algorithm to build a model using the given Text (as text file, or an Array) and then build Word2vec model from it.

Here is the full Scala code of the following example at my github.

Lets start H2O cluster first:

import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)

Now we will be importing required libraries to get our job done:

import scala.io.Source
import _root_.hex.word2vec.{Word2Vec, Word2VecModel}
import _root_.hex.word2vec.Word2VecModel.Word2VecParameters
import water.fvec.Vec

Now we will be creating a stop words list which are not useful for text mining and removed from the word source:

val STOP_WORDS = Set("ourselves", "hers", "between", "yourself", "but", "again", "there", "about", 
    "once", "during", "out", "very", "having", "with", "they", "own", "an", "be", "some", "for", "do", 
    "its", "yours", "such", "into", "of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as", 
     "from", "him", "each", "the", "themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were", "her", 
    "more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up", 
    "to", "ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have", "in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can", 
    "did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself", "which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by", "doing", 
    "it", "how", "further", "was", "here", "than")

Note:

Now lets ingest the text data we would want to run Word2Vec algorithms to vectorize the data first and then run machine learning experiment to it.

I have downloaded a free story “The Adventure of Sherlock Holmes” from Internet and using that as my source.  

val filename = "/Users/avkashchauhan/Downloads/TheAdventuresOfSherlockHolmes.txt"
val lines = Source.fromFile(filename).getLines.toArray
val sparkframe = sc.parallelize(lines)

Now lets defined the tokenize function which will convert out input text to tokens:

def tokenize(line: String) = {
 //get rid of nonWords such as punctuation as opposed to splitting by just " "
 line.split("""\W+""")
 .map(_.toLowerCase)

//Lets remove stopwords defined above
 .filterNot(word => STOP_WORDS.contains(word)) :+ null
}

Now we will be calling the tokenize function to create a list of labeled words:

val allLabelledWords = sparkframe.flatMap(d => tokenize(d))

Note: You can also use your own or a custom tokenize function from a library as well, you just need to map the function to the DataFrame.

Now lets convert the collection of label words into an H2O DataFrame:

val h2oFrame = h2oContext.asH2OFrame(allLabelledWords)

Here is the time now to use the H2O Word2Vec algorithm by configuring the parameters first:

val w2vParams = new Word2VecParameters
w2vParams._train = h2oFrame._key
w2vParams._epochs = 500
w2vParams._min_word_freq = 0
w2vParams._init_learning_rate = 0.05f
w2vParams._window_size = 20
w2vParams._vec_size = 20
w2vParams._sent_sample_rate = 0.0001f

Now we will perform the real action, building the model:

val w2v = new Word2Vec(w2vParams).trainModel().get()

Now we can apply the model to perform some actions on it:

Lets start first test by finding synonyms using this given word2vec model. We will be calling findSynonyms method by passing a given word  to find N synonyms, the results will be the top ‘count’ synonyms with their distance values:

w2v.findSynonyms("love", 3)
w2v.findSynonyms("help", 2)
w2v.findSynonyms("hate", 1)

Lets Transform words using w2v model and aggregate method average:

The transform() function takes an H2O Vec as the first parameter, where the vector needs to be extracted from the H2O frame h2oFrame.

val newSparkFrame = w2v.transform(h2oFrame.vec(0), Word2VecModel.AggregateMethod.NONE).toTwoDimTable()

Thats it, enjoy!!

 

Python example of building GLM, GBM and Random Forest Binomial Model with H2O

Here is an example of using H2O machine learning library and then building GLM, GBM and Distributed Random Forest models for categorical response variable.

Lets import h2o library and initialize the H2O machine learning cluster:

import h2o
h2o.init()

Importing dataset and getting familiar with it:

df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
df.summary()
df.col_names

Lets configure our predictors and response variables from the ingested dataset:

y = 'CAPSULE'
x = df.col_names
x.remove(y)
print("Response = " + y)
print("Pridictors = " + str(x))

Now we need to set the response column as categorical or factor:

df['CAPSULE'] = df['CAPSULE'].asfactor()

Now we will the levels in our response variable:

 

df['CAPSULE'].levels()
[['0', '1']]

Note: Because there are only 2 levels or values, the model will be called Binomial model.

Now we will split our dataset into training, validation and testing datasets:

train, valid, test = df.split_frame(ratios=[.8, .1])
print(df.shape)
print(train.shape)
print(valid.shape)
print(test.shape)

Lets build Generalized Linear Regression (Logistic – response variable is categorical) model first:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
glm_logistic.train(x=x, y= y, training_frame=train, validation_frame=valid, 
 model_id="glm_logistic")

Now we will take a look at few model metrics:

glm_logistic.varimp()
Warning: This model doesn't have variable importances

Lets have a look at model coefficients:

glm_logistic.coef()

Lets perform the prediction using the testing dataset:

glm_logistic.predict(test_data=test)

Now we are checking the model performance metrics “rmse” based on testing and other datasets:

print(glm_logistic.model_performance(test_data=test).rmse())
print(glm_logistic.model_performance(test_data=valid).rmse())
print(glm_logistic.model_performance(test_data=train).rmse())

Now we are checking the model performance metrics “r2” based on testing and other datasets:

print(glm.model_performance(test_data=test).r2())
print(glm.model_performance(test_data=valid).r2())
print(glm.model_performance(test_data=train).r2())

Lets build Gradient Boosting Model now:

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator()
gbm.train(x=x, y =y, training_frame=train, validation_frame=valid)

Now get to know our model metrics, starting with confusion metrics first:

gbm.confusion_matrix()

Now have a look at variable importance plots:

gbm.varimp_plot()

Now have a look at the variable importance table:

gbm.varimp()

Lets build Distributed Random Forest model:

from h2o.estimators.random_forest import H2ORandomForestEstimator
drf = H2ORandomForestEstimator()
drf.train(x=x, y = y, training_frame=train, validation_frame=valid)

lets understand random forest model metrics starting confusion metrics:

drf.confusion_matrix()

We can have a look at gains and lift table also:

drf.gains_lift()

Note:

  • We can get all model metrics as other model type as applied.
  • We can also get model perform based on training, validation and testing data for all models.

Thats it, enjoy!!

 

Visualizing H2O GBM and Random Forest MOJO Models Trees in python

In this example we will build a tree based model first using H2O machine learning library and the save that model as MOJO. Using GraphViz/Dot library we will extract individual trees/cross validated model trees from the MOJO and visualize them. If you are new to H2O MOJO model, learn here.

You can also get full working Ipython Notebook for this example from here.

Lets build the model first using H2O GBM algorithm. You can also use Distributed Random Forest Model as well for tree visualization.

Let’s first import key python models:

import h2o
import subprocess
from IPython.display import Image

Now we will be building GBM Model using a public PROSTATE dataset:

h2o.init()
df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
y = 'CAPSULE'
x = df.col_names
x.remove(y)
df[y] = df[y].asfactor()
train, valid, test = df.split_frame(ratios=[.8,.1])
from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm_cv3 = H2OGradientBoostingEstimator(nfolds=3)
gbm_cv3.train(x=x, y=y, training_frame=train)

## Getting all cross validated models 
all_models = gbm_cv3.cross_validation_models()
print("Total cross validation models: " + str(len(all_models)))

Now lets set all the default parameters to create the graph tree first and then tree images (in PNG format) in the local disk. Make sure you have a writable path where you can create and save these intermediate files. You also need to provide the path for latest H2O (h2o.jar) which is used to generate MOJO Model.

mojo_file_name = "/Users/avkashchauhan/Downloads/my_gbm_mojo.zip"
h2o_jar_path= '/Users/avkashchauhan/tools/h2o-3/h2o-3.14.0.3/h2o.jar'
mojo_full_path = mojo_file_name
gv_file_path = "/Users/avkashchauhan/Downloads/my_gbm_graph.gv"

Now lets definie Image file name which we will generate from the Tree ID.  Based on Tree ID the image file will have my_gbm_tree_ID.png file name

image_file_name = "/Users/avkashchauhan/Downloads/my_gbm_tree"
Now we will be downloading GBM MOJO Model by saving to disk:
 gbm_cv3.download_mojo(mojo_file_name)

Now lets define the function to generate graphViz tree from the saved MOJO model:

def generateTree(h2o_jar_path, mojo_full_path, gv_file_path, image_file_path, tree_id = 0):
    image_file_path = image_file_path + "_" + str(tree_id) + ".png"
    result = subprocess.call(["java", "-cp", h2o_jar_path, "hex.genmodel.tools.PrintMojo", "--tree", str(tree_id), "-i", mojo_full_path , "-o", gv_file_path ], shell=False)
    result = subprocess.call(["ls",gv_file_path], shell = False)
    if result is 0:
        print("Success: Graphviz file " + gv_file_path + " is generated.")
    else: 
        print("Error: Graphviz file " + gv_file_path + " could not be generated.")

Now lets defined the method to generate Tree image as PNG from the saved GraphViz tree:

def generateTreeImage(gv_file_path, image_file_path, tree_id):
    image_file_path = image_file_path + "_" + str(tree_id) + ".png"
    result = subprocess.call(["dot", "-Tpng", gv_file_path, "-o", image_file_path], shell=False)
    result = subprocess.call(["ls",image_file_path], shell = False)
    if result is 0:
        print("Success: Image File " + image_file_path + " is generated.")
        print("Now you can execute the follow line as-it-is to see the tree graph:") 
        print("Image(filename='" + image_file_path + "\')")
    else:
        print("Error: Image file " + image_file_path + " could not be generated.")

Note: I had to write 2 steps process above because If I put all in 1 step the process hung after graphviz is created.

Now lets generate tree by passing all parameters defined above and proper TREE ID as the last parameter.

#Just change the tree id in the function below to get which particular tree you want
generateTree(h2o_jar_path, mojo_full_path, gv_file_path, image_file_name, 3)

Now we will be generating PNG Tree Image from the saved GraphViz content.

generateTreeImage(gv_file_path, image_file_name, 3)
# Note: If this step hangs, you can look at "dot" active process in osx and try killing it

Lets visualize the main model tree:

# Just pass the Tree Image file name depending on your tree
Image(filename='/Users/avkashchauhan/Downloads/my_gbm_tree_0.png')

tree-0

Lets Visualize the first Cross Validation tree (Cross Validation ID- 1)

# Just pass the Tree Image file name depending on your tree
Image(filename='/Users/avkashchauhan/Downloads/my_gbm_tree_1.png')

tree-1

Lets Visualize the first Cross Validation tree (Cross Validation ID- 2)

# Just pass the Tree Image file name depending on your tree
Image(filename='/Users/avkashchauhan/Downloads/my_gbm_tree_2.png')

tree-2

Lets Visualize the first Cross Validation tree (Cross Validation ID- 3)

Just pass the Tree Image file name depending on your tree

Image(filename=’/Users/avkashchauhan/Downloads/my_gbm_tree_3.png’)

tree-3

After looking at these tree, you can visualize how the decision are made.

Helpful documentation:

Thats it, enjoy!!

Stacked Ensemble Model in Scala using H2O GBM and Deep Learning Models

In this full Scala sample we will be using H2O Stacked Ensembles algorithm. Stacked ensemble is a process of building models of various types first with cross-validation and keep fold columns for each model. In the next step building the stacked ensemble model using all the CV folds. You can learn more about Stacked Ensembles here.

In this Stacked Ensemble we will be using GBM and Deep Learning Algorithms and then finally building the Stacked Ensemble model using the GBM and Deep Learning models.

First lets import key classes specific to H2O:

import org.apache.spark.h2o._
import water.Key
import java.io.File

Now we will create H2O context so we can call key H2O function specific to data ingest and Deep Learning algorithms:

val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Lets import data from local file system as H2O Data Frame:

val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

In this Stacked Ensemble we will be using GBM and Deep Learning Algorithms so lets first build the deep learning model:

import _root_.hex.deeplearning.DeepLearning
import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters

val dlParams = new DeepLearningParameters()
dlParams._epochs = 100
dlParams._train = prostateData
dlParams._response_column = 'CAPSULE
dlParams._variable_importances = true
dlParams._nfolds = 5
dlParams._seed = 1111
dlParams._keep_cross_validation_predictions = true;
val dl = new DeepLearning(dlParams, Key.make("dlProstateModel.hex"))
val dlModel = dl.trainModel.get

Now lets build the GBM model:

import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE
gbmParams._nfolds = 5
gbmParams._seed = 1111
gbmParams._keep_cross_validation_predictions = true;
val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmModel = gbm.trainModel().get()

Now build the Stacked Ensemble Models so first we need classes required for Stacked Ensembles as below:

import _root_.hex.Model
import _root_.hex.StackedEnsembleModel
import _root_.hex.ensemble.StackedEnsemble

Now we will define Stacked Ensembles parameters as below:

val stackedEnsembleParameters = new StackedEnsembleModel.StackedEnsembleParameters()
stackedEnsembleParameters._train = prostateData._key
stackedEnsembleParameters._response_column = 'CAPSULE

Now we need to pass all the different algorithms we would want to use in the Stacked Ensemble by passing their keys as below:

type T_MODEL_KEY = Key[Model[_, _ <: Model.Parameters, _ <:Model.Output]]

// Option 1
stackedEnsembleParameters._base_models = Array(gbmRegModel._key.asInstanceOf[T_MODEL_KEY], dlModel._key.asInstanceOf[T_MODEL_KEY])
// Option 2 
stackedEnsembleParameters._base_models = Array(gbmRegModel, dlModel).map(model => model._key.asInstanceOf[T_MODEL_KEY])

// Note: You can choose any of the above option to pass the model keys

Finally defining the stacked ensemble job as below:

val stackedEnsembleJob = new StackedEnsemble(stackedEnsembleParameters)

And as the last steps let build the stacked ensemble model:

val stackedEnsembleModel = stackedEnsembleJob.trainModel().get();

Now we can take a look at our Stacked Ensemble model as below:

stackedEnsembleModel

Thats it, enjoy!!

Helpful content: https://github.com/h2oai/h2o-3/blob/a554bffabda6770386a31d47e05f00543d7b9ac3/h2o-algos/src/test/java/hex/ensemble/StackedEnsembleTest.java

 

Logistic Regression with H2O Deep Learning in Scala

Here is the sample code which show using Feed Forward Network based Deep Learning algorithms from H2O to perform a logistic regression .

First lets import key classes specific to H2O

import org.apache.spark.h2o._
import water.Key
import java.io.File

Now we will create H2O context so we can call key H2O function specific to data ingest and Deep Learning algorithms:

val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Lets import data from local file system as H2O Data Frame:

val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

Now lets import Deep Learning classes:

import root.hex.deeplearning.DeepLearning
import root.hex.deeplearning.DeepLearningModel.DeepLearningParameters

Now we will define all key parameters specific to H2O Deep Learning Algorithm

val dlParams = new DeepLearningParameters()
dlParams._epochs = 100
dlParams._train = prostateData
dlParams._response_column = 'CAPSULE
dlParams._variable_importances = true
dlParams._nfolds = 5
dlParams._seed = 1111
dlParams._keep_cross_validation_predictions = true;

Now we will create the Deep Learning Algorithm key first and then start the deep learning algorithm in blocking mode:

val dl = new DeepLearning(dlParams, Key.make("dlProstateModel.hex"))
val dlModel = dl.trainModel.get()

Lets learn more about our model:

dlModel

Now we can perform the prediction by passing an H2O Dataframe (Here I am simply passing the original data frame however you can load your test  data frame and pass it as H2O frame to perform prediction.):

val predictionH2OFrame = dlModel.score(prostateData)('predict)
val predictionsFromModel = asRDD[DoubleHolder](predictionH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

 

 

H2O AutoML examples in python and Scala

AutoML is included into H2O version 3.14.0.1 and above. You can learn more about AutoML in the H2O blog here.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

Here is the full working python code taken from here:

import h2o
from h2o.automl import H2OAutoML

h2o.init()
df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
train, test = df.split_frame(ratios=[.9])
# Identify predictors and response
x = train.columns
y = "CAPSULE"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 60 seconds
aml = H2OAutoML(max_runtime_secs = 60)
aml.train(x = x, y = y, training_frame = train, leaderboard_frame = test)

# View the AutoML Leaderboard
aml.leaderboard
aml.leader

# To generate predictions on a test set, use `"H2OAutoML"` object, or on the leader model object directly as below:
preds = aml.predict(test)
# or
preds = aml.leader.predict(test)

Here is the full working Scala code:

import ai.h2o.automl.AutoML;
import ai.h2o.automl.AutoMLBuildSpec
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import java.io.File
import h2oContext.implicits._
import water.Key
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))
val autoMLBuildSpec = new AutoMLBuildSpec()
autoMLBuildSpec.input_spec.training_frame = prostateData
autoMLBuildSpec.input_spec.response_column = "CAPSULE";
autoMLBuildSpec.build_control.loss = "AUTO"
autoMLBuildSpec.build_control.stopping_criteria.set_max_runtime_secs(5)
import java.util.Date;
val aml = AutoML.makeAutoML(Key.make(), new Date(), autoMLBuildSpec)
AutoML.startAutoML(aml)
// Note: In some cases the above call is non-blocking
// So using the following alternative function will block the next commmand, untill the exection of action command
AutoML.startAutoML(autoMLBuildSpec).get()  ## This is forced blocking call
aml.leader
aml.leaderboard

IF you want to see the full code execution visit here.

Thats it, enjoy!!

Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.

Lets first import all the classes we need for this project:

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

Next we need to start H2O cluster so we can start using H2O APIs:

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

Now we need to ingest the data which we can use to perform modeling:

// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

// Understanding our input data
prostateData.names
prostateData.numCols
prostateData.numRows
prostateData.keys
prostateData.key

Now we will import some H2O specific classes we need to perform our actions:

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

Lets setup GBM Parameters which will shape our GBM modeling process:

val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE

In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:

val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()

Lets get to know our GBM Model and we will see that the type of this model is “regression”:

gbmRegModel

Lets perform prediction using GBM Regression Model:

val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:

prostateData.names()
//
// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException: 
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE

We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:

import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli

Now lets build our GBM  model now:

val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()

Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :

gbmBinModel

Now lets perform the prediction using our GBM Binomial Classification Model as below:

val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats all, enjoy!!

 

 

 

Scala Example with Grid Search and Hyperparameters for GBM in H2O

Here is the full source code for GBM Scala code to perform Grid Search and Hyper parameters optimization using H2O (here is the github code as well):

import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.SparkContextSupport.addFiles
import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._

// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._

// Register files to SparkContext
addFiles(sc,
 "/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/year2005.csv.gz",
 "/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/Chicago_Ohare_International_Airport.csv")

// Import all year airlines data into H2O
val airlinesData = new H2OFrame(new File(SparkFiles.get("year2005.csv.gz")))

// Import weather data into Spark
val wrawdata = sc.textFile(SparkFiles.get("Chicago_Ohare_International_Airport.csv"),8).cache()
val weatherTable = wrawdata.map(_.split(",")).map(row => WeatherParse(row)).filter(!_.isWrongRow())

// Transfer data from H2O to Spark DataFrame
val airlinesTable = h2oContext.asDataFrame(airlinesData).map(row => AirlinesParse(row))
val flightsToORD = airlinesTable.filter(f => f.Dest==Some("ORD"))

// Use Spark SQL to join flight and weather data in spark
flightsToORD.toDF.createOrReplaceTempView("FlightsToORD")
weatherTable.toDF.createOrReplaceTempView("WeatherORD")

// Perform SQL Join on both tables
val bigTable = sqlContext.sql(
 """SELECT
 |f.Year,f.Month,f.DayofMonth,
 |f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime,
 |f.UniqueCarrier,f.FlightNum,f.TailNum,
 |f.Origin,f.Distance,
 |w.TmaxF,w.TminF,w.TmeanF,w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD,
 |f.IsDepDelayed
 |FROM FlightsToORD f
 |JOIN WeatherORD w
 |ON f.Year=w.Year AND f.Month=w.Month AND f.DayofMonth=w.Day""".stripMargin)




val trainFrame:H2OFrame = bigTable
withLockAndUpdate(trainFrame){ fr => fr.replace(19, fr.vec("IsDepDelayed").toCategoricalVec)}

bigTable.numCols
bigTable.numRows

import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters

val gbmParams = new GBMParameters()

gbmParams._train = trainFrame
gbmParams._response_column = 'IsDepDelayed

import _root_.hex.genmodel.utils.DistributionFamily

gbmParams._distribution = DistributionFamily.bernoulli

val gbm = new GBM(gbmParams,Key.make("gbmModel.hex"))
val gbmModel = gbm.trainModel.get
// Same as above
val gbmModel = gbm.trainModel().get()

// Use model to estimate delay on training data
val predGBMH2OFrame = gbmModel.score(trainFrame)('predict)
val predGBMFromModel = asRDD[DoubleHolder](predGBMH2OFrame).collect.map(_.result.getOrElse(Double.NaN))

def let[A](in: A)(body: A => Unit) = {
 body(in)
 in
}




import _root_.hex.grid.{GridSearch}
import _root_.hex.grid.GridSearch
import _root_.hex.ScoreKeeper

import water.Key
import scala.collection.JavaConversions._

val gbmHyperSpace: java.util.Map[String, Array[Object]] = Map[String, Array[AnyRef]](
 "_ntrees" -> (1 to 10).map(v => Int.box(100*v)).toArray,
 "_max_depth" -> (2 to 7).map(Int.box).toArray,
 "_learn_rate" -> Array(0.1, 0.01).map(Double.box),
 "_col_sample_rate" -> Array(0.3, 0.7, 1.0).map(Double.box),
 "_learn_rate_annealing" -> Array(0.8, 0.9, 0.95, 1.0).map(Double.box)
)

// @Snippet
import _root_.hex.grid.HyperSpaceSearchCriteria.RandomDiscreteValueSearchCriteria




val gbmHyperSpaceCriteria = let(new RandomDiscreteValueSearchCriteria) { c =>
 c.set_stopping_metric(ScoreKeeper.StoppingMetric.RMSE)
 c.set_stopping_tolerance(0.1)
 c.set_stopping_rounds(1)
 c.set_max_runtime_secs(4 * 60 /* seconds */)
}

//
// This step will create 
// If you will pass the code below it will run through also for long time
// val gs = GridSearch.startGridSearch(null, gbmParams, gbmHyperSpace);
// 
val gbmGrid = GridSearch.startGridSearch(Key.make("gbmGridModel"),
 gbmParams,
 gbmHyperSpace,
 new GridSearch.SimpleParametersBuilderFactory[GBMParameters],
 gbmHyperSpaceCriteria).get()




// Training Frame Info
gbmGrid.getTrainingFrame

//
// Looking at gird models by Keys
//
val mKeys = gbmGrid.getModelKeys()
gbmGrid.createSummaryTable(mKeys, "mse", true);
gbmGrid.createSummaryTable(mKeys, "rmse", true);

// Model Count
gbmGrid.getModelCount

// All Models
gbmGrid.getModels
val ms = gbmGrid.getModels()
val gbm =ms(0)
val gbm =ms(1)
val gbm =ms(2)

// All hyper parameters
gbmGrid.getHyperNames

Thats it, Enjoy!!