Python example of building GLM, GBM and Random Forest Binomial Model with H2O

Here is an example of using H2O machine learning library and then building GLM, GBM and Distributed Random Forest models for categorical response variable.

Lets import h2o library and initialize the H2O machine learning cluster:

import h2o
h2o.init()

Importing dataset and getting familiar with it:

df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
df.summary()
df.col_names

Lets configure our predictors and response variables from the ingested dataset:

y = 'CAPSULE'
x = df.col_names
x.remove(y)
print("Response = " + y)
print("Pridictors = " + str(x))

Now we need to set the response column as categorical or factor:

df['CAPSULE'] = df['CAPSULE'].asfactor()

Now we will the levels in our response variable:

 

df['CAPSULE'].levels()
[['0', '1']]

Note: Because there are only 2 levels or values, the model will be called Binomial model.

Now we will split our dataset into training, validation and testing datasets:

train, valid, test = df.split_frame(ratios=[.8, .1])
print(df.shape)
print(train.shape)
print(valid.shape)
print(test.shape)

Lets build Generalized Linear Regression (Logistic – response variable is categorical) model first:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
glm_logistic.train(x=x, y= y, training_frame=train, validation_frame=valid, 
 model_id="glm_logistic")

Now we will take a look at few model metrics:

glm_logistic.varimp()
Warning: This model doesn't have variable importances

Lets have a look at model coefficients:

glm_logistic.coef()

Lets perform the prediction using the testing dataset:

glm_logistic.predict(test_data=test)

Now we are checking the model performance metrics “rmse” based on testing and other datasets:

print(glm_logistic.model_performance(test_data=test).rmse())
print(glm_logistic.model_performance(test_data=valid).rmse())
print(glm_logistic.model_performance(test_data=train).rmse())

Now we are checking the model performance metrics “r2” based on testing and other datasets:

print(glm.model_performance(test_data=test).r2())
print(glm.model_performance(test_data=valid).r2())
print(glm.model_performance(test_data=train).r2())

Lets build Gradient Boosting Model now:

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator()
gbm.train(x=x, y =y, training_frame=train, validation_frame=valid)

Now get to know our model metrics, starting with confusion metrics first:

gbm.confusion_matrix()

Now have a look at variable importance plots:

gbm.varimp_plot()

Now have a look at the variable importance table:

gbm.varimp()

Lets build Distributed Random Forest model:

from h2o.estimators.random_forest import H2ORandomForestEstimator
drf = H2ORandomForestEstimator()
drf.train(x=x, y = y, training_frame=train, validation_frame=valid)

lets understand random forest model metrics starting confusion metrics:

drf.confusion_matrix()

We can have a look at gains and lift table also:

drf.gains_lift()

Note:

  • We can get all model metrics as other model type as applied.
  • We can also get model perform based on training, validation and testing data for all models.

Thats it, enjoy!!

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s