Grid Search for Naive Bayes in R using H2O

Here is a R sample code to show how to perform grid search in Naive Bayes algorithm using H2O machine learning platform:

# H2O
library(h2o)
library(ggplot2)
library(data.table)
 
# initialize the cluster with all the threads available
h2o.init(nthreads = -1)
h2o.init()
h2o.init(max_mem_size = "2g")
 
# Variables Necesarias
train.h2o<-as.h2o(training)
test.h2o <-as.h2o(testing)
names(train.h2o)
str(train.h2o)
 
y <-4
x <-c(5:16)
 
# specify the list of paramters
hyper_params <- list(
 laplace = c(0,0.5,1,2,3)
)
 
threshold =c(0.001,0.00001,0.0000001)
 
# performs the grid search
grid_id <-"dl_grid"
model_bayes_grid <- h2o.grid(
 algorithm = "naivebayes", # name of the algorithm
 grid_id = grid_id,
 training_frame = train.h2o,
 validation_frame = test.h2o,
 x = x,
 y = y,
 hyper_params = hyper_params
)
 
# find the best model and evaluate its performance
stopping_metric <- 'accuracy'
sorted_models <- h2o.getGrid(
 grid_id = grid_id,
 sort_by = stopping_metric,
 decreasing = TRUE
)
 
best_model<-h2o.getModel(sorted_models@model_ids[[1]])
best_model
 
h2o.confusionMatrix(best_model, valid = TRUE, metrics = 'accuracy')
 

auc <- h2o.auc(best_model, valid = TRUE)
fpr <- h2o.fpr( h2o.performance(best_model, valid = TRUE) )[['fpr']]
tpr <- h2o.tpr( h2o.performance(best_model, valid = TRUE) )[['tpr']]
ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +
 geom_line() + theme_bw()+ggtitle( sprintf('AUC: %f', auc) )
 

# To obtain the regularization, laplace, do the following:
best_model@parameters
 best_model@parameters$laplace

Thats it, enjoy!!

Advertisements

Filtering H2O data frame on multiple fields of date and int type

Lets create an H2O frame using h2o.create_frame API:

df = h2o.create_frame(time_fraction = .1,rows=10, cols = 10)

Above will create a frame of 10 rows and 10 columns and based on time_fraction values 0.1 (1 out of 10 provided columns) will be date/time columns. The data frame looks as below:

Screen Shot 2017-04-27 at 1.20.12 PM

Here are few example filtering scripts:

df1 = df[ (df['C4'] > 0) & (df['C7'] < 10)]
df2 = df[ (df['C4'] > 0) & (df['C7'] < 10)   & (df['C9'] > datetime.datetime(2000,1,1))  ]
df2 = df[ ((df['C4'] > 0) | (df['C7'] < 10)) & (df['C9'] > datetime.datetime(2000,1,1)) ]

and the screenshot:

Screen Shot 2017-04-27 at 1.19.09 PM

Thats it, enjoy!!

Building high order polynomials with GLM for higher accuracy

Sometimes when building GLM models, you would like to configure GLM to search for higher order polynomial of the features .

The reason you may have to do is that, you may have strong predictors for a model and going for high order polynomial of predictors you will get higher accuracy.

With H2O, you can create higher order polynomials as below:

  • Look for  ‘interactions’ parameter in GLM model.
  • In the interaction parameters add  list of predictor columns to interact.
When model will be build, all pairwise combinations will be computed for this list. Following is a working sample:
boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
predictors = boston.columns[:-1]
response = "medv"
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
interactions_list = ['crim', 'dis']
boston_glm = H2OGeneralizedLinearEstimator(interactions = interactions_list)
boston_glm.train(x = predictors, y = response,training_frame = boston)
boston_glm.coef()
To explore interactions among categorical variables please do the following:
h2o.interaction
Thats all, enjoy!!

Applying AND, OR, NOT conditions as filter into dataframe

Question:

How to add conditions into data frame filters, to express the function (AND, OR, NOT)? For example, I have two flags:

  1. myData flag to be myData_flag
  2. myProx flag to be is_myProx_t_f.

Conditions are defined as below:

  • AND: is it data_myDatamyProx = data[(data[‘myData_flag’].isin([‘1’]),:)&&( data[‘is_myProx_t_f’].isin([‘1’]),:)]?
  • OR: is it data_myDataOrmyProx = data[(data[‘myData_flag’].isin([‘1’]),:)||( data[‘is_myProx_t_f’].isin([‘1’]),:)]?
  • NOT: is it data_NonemyDatamyProx = data[(data[‘myData_flag’].isnotin([‘1’]),:)||( data[‘is_myProx_t_f’].isnotin([‘1’]),:)]?

Solution:

For AND, OR operators, you can already accomplish this like below (using iris dataset as an example):
df[(df['Sepal.Length'] < 5) & (df['Sepal.Width'] > 3) | (df['Species'].isin(['setosa'])), :]
Above, the operators are &, | and negation is the tilda ~ .
Thats it, enjoy!!

Creating a new columns into data frame from calculation over data

Sometime you may need to operate either the full data frame or a specific column with a function and add new column which consist the results. This is how you can do it:

# Create a test frame
c_names = ['Prediction']
data1 = np.array([[0.12],
                  [0.43],
                  [0.90],
                  [0.002],
                  [0.52]])
df = h2o.H2OFrame().from_python(data1, destination_frame='df', column_names=c_names)

# Applying the function on specific column from frame and creating new column into same data frame:
df['new_prediction'] = df['Prediction']*1000
print df
Thats it, enjoy!!

Binomial classification example in Scala and GBM with H2O

Here is a sample for binomial classification problem using H2O GBM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Preduction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

Thats it, enjoy!!

Binomial classification example in Scala and GLM with H2O

Here is a sample for binomial classification problem using H2O GLM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
import water.support.{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial


val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 glm.trainModel().get()
 //val glmModel = glm.trainModel().get()
}

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)
println(trainMetrics.rmse)
println(validMetrics.rmse)
println(trainMetrics.mse)
println(validMetrics.mse)
println(trainMetrics.r2)
println(validMetrics.r2)
println(trainMetrics.auc)
println(validMetrics.auc)

// Prediction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame).collect.map(_.result.getOrElse(Double.NaN))

 

Thats it, enjoy!!