How R2 error is calculated in Generalized Linear Model

What is R2 (R^2 i.e. R-Squared)?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. … 100% indicates that the model explains all the variability of the response data around its mean. (From here)

You can get the full working jupyter notebook for this article from here directly from my Github.

Even when this article explains how R^2 error is calculated for an H2O GLM (Generalized Linear Model) however same math is use for any other statistical model. So you can use this function anywhere you would want to apply.

Lets build an H2O GLM Model first:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator


local_url = ""
df = h2o.import_file(local_url)

feature_names = df.col_names

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])

prostate_glm = H2OGeneralizedLinearEstimator(model_id = "prostate_glm")

prostate_glm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

Now calculate Model Performance based on training, validation and test data:

train_performance = prostate_glm.model_performance(df_train)
valid_performance = prostate_glm.model_performance(df_valid)
test_performance = prostate_glm.model_performance(df_test)

Now lets check the default R^2 metrics for training, validation and test data:


Now lets get the prediction for the test data which we kept separate:

predictions = prostate_glm.predict(df_test)

Here is the math which is use to calculate the R2 metric for the test data set:

SSE = ((predictions-df_test[y])**2).sum()
y_hat = df_test[y].mean()
SST = ((df_test[y]-y_hat[0])**2).sum()

Now lets get model performance for given test data as below:


Above we can see that both values, one give by model performance for test data and the other we calculated are same.

Thats it, enjoy!!






Python example of building GLM, GBM and Random Forest Binomial Model with H2O

Here is an example of using H2O machine learning library and then building GLM, GBM and Distributed Random Forest models for categorical response variable.

Lets import h2o library and initialize the H2O machine learning cluster:

import h2o

Importing dataset and getting familiar with it:

df = h2o.import_file("")

Lets configure our predictors and response variables from the ingested dataset:

x = df.col_names
print("Response = " + y)
print("Pridictors = " + str(x))

Now we need to set the response column as categorical or factor:

df['CAPSULE'] = df['CAPSULE'].asfactor()

Now we will the levels in our response variable:


[['0', '1']]

Note: Because there are only 2 levels or values, the model will be called Binomial model.

Now we will split our dataset into training, validation and testing datasets:

train, valid, test = df.split_frame(ratios=[.8, .1])

Lets build Generalized Linear Regression (Logistic – response variable is categorical) model first:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
glm_logistic.train(x=x, y= y, training_frame=train, validation_frame=valid, 

Now we will take a look at few model metrics:

Warning: This model doesn't have variable importances

Lets have a look at model coefficients:


Lets perform the prediction using the testing dataset:


Now we are checking the model performance metrics “rmse” based on testing and other datasets:


Now we are checking the model performance metrics “r2” based on testing and other datasets:


Lets build Gradient Boosting Model now:

from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm = H2OGradientBoostingEstimator()
gbm.train(x=x, y =y, training_frame=train, validation_frame=valid)

Now get to know our model metrics, starting with confusion metrics first:


Now have a look at variable importance plots:


Now have a look at the variable importance table:


Lets build Distributed Random Forest model:

from h2o.estimators.random_forest import H2ORandomForestEstimator
drf = H2ORandomForestEstimator()
drf.train(x=x, y = y, training_frame=train, validation_frame=valid)

lets understand random forest model metrics starting confusion metrics:


We can have a look at gains and lift table also:



  • We can get all model metrics as other model type as applied.
  • We can also get model perform based on training, validation and testing data for all models.

Thats it, enjoy!!


Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM:

To add cross validation you can do the following:

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._nfolds = 3  ###### Here is cross-validation ###
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))

To look cross-validated model try this:

scala> glmModel._output._cross_validation_models
res12: Array[water.Key[_ <: water.Keyed[_ <: AnyRef]]] = 
    Array(glmModel.hex_cv_1, glmModel.hex_cv_2, glmModel.hex_cv_3)

Now to get each model do the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]

And you will see the following:

scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]
m1: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_1
 frame id: glmModel.hex_cv_1_train
 MSE: 0.14714406
 RMSE: 0.38359362
 AUC: 0.7167627
 logloss: 0.4703465
 mean_per_class_error: 0.31526923
 default threshold: 0.27434438467025757
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10704 1651 0.1336 1,651 / 12,355
 1 1768 1790 0.4969 1,768 / 3,558
Totals 12472 3441 0.2149 3,419 / 15,913
Gains/Lift Table (Avg response rate: 22.36 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005467 0....
scala> val m2 = DKV.getGet("glmModel.hex_cv_2").asInstanceOf[GLMModel]
m2: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_2
 frame id: glmModel.hex_cv_2_train
 MSE: 0.14598908
 RMSE: 0.38208517
 AUC: 0.7231473
 logloss: 0.46717605
 mean_per_class_error: 0.31456697
 default threshold: 0.29637953639030457
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 11038 1395 0.1122 1,395 / 12,433
 1 1847 1726 0.5169 1,847 / 3,573
Totals 12885 3121 0.2025 3,242 / 16,006
Gains/Lift Table (Avg response rate: 22.32 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01005873 0...
scala> val m3 = DKV.getGet("glmModel.hex_cv_3").asInstanceOf[GLMModel]
m3: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
 Description: N/A
 model id: glmModel.hex_cv_3
 frame id: glmModel.hex_cv_3_train
 MSE: 0.14626761
 RMSE: 0.38244948
 AUC: 0.7239823
 logloss: 0.46873763
 mean_per_class_error: 0.31437498
 default threshold: 0.28522220253944397
 CM: Confusion Matrix (vertical: actual; across: predicted):
 0 1 Error Rate
 0 10982 1490 0.1195 1,490 / 12,472
 1 1838 1771 0.5093 1,838 / 3,609
Totals 12820 3261 0.2070 3,328 / 16,081
Gains/Lift Table (Avg response rate: 22.44 %):
 Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
 1 0.01001182 0...

Thats it, enjoy!!


How to regularize intercept in GLM

Sometime you may want to emulate hierarchical modeling to achieve your objective you can use beta_constraints as below:
iris = h2o.import_file("")
bc = h2o.H2OFrame([("Intercept",-1000,1000,3,30)], column_names=["names","lower_bounds","upper_bounds","beta_given","rho"])
glm = H2OGeneralizedLinearEstimator(family = "gaussian", 
The output will look like as below:
{u'Intercept': 3.000933645168297,
 u'class.Iris-setosa': 0.0,
 u'class.Iris-versicolor': 0.0,
 u'class.Iris-virginica': 0.0,
 u'petal_len': 0.4423526962303227,
 u'petal_wid': 0.0,
 u'sepal_wid': 0.37712042938039897}
There’s more information in the GLM booklet linked below, but the short version is to create a new constraints frame with the columns: names, lower_bounds, upper_bounds, beta_given, & rho, and have a row entry per feature you want to constrain. You can use “Intercept” as a keyword to constraint the intercept.
names: (mandatory) coefficient names
ˆ lower bounds: (optional) coefficient lower bounds , must be less thanor equal to upper bounds
ˆ upper bounds: (optional) coefficient upper bounds , must be greaterthan or equal to lower bounds
ˆ beta given: (optional) specifies the given solution in proximal operatorinterface
ˆ rho (mandatory if beta given is specified, otherwise ignored): specifiesper-column L2 penalties on the distance from the given solution
If you want to go deeper to learn how these L1/L2 parameters works, here are more details:
What’s happening is an L2 penalty is being applied between the coeffecient & given. The proximal penalty is computed: Σ(x-x’)*rho. You can confirm this by setting rho to be whatever lambda may be, and set let lambda to 0. This will give the same result as having set lambda to that value.
You can use beta constraints to assign per-feature regularization strength
but only for l2 penalty. The math is explained here:
sum_i rho[i] * L2norm2(beta[i]-betagiven[i])
So if you set beta given to zero, and say all rho except for the intercept to 1e-5
then it is equivalent to running without BC, just  with alpha = 0, lambda = 1e-5
Thats it, enjoy!!

Building high order polynomials with GLM for higher accuracy

Sometimes when building GLM models, you would like to configure GLM to search for higher order polynomial of the features .

The reason you may have to do is that, you may have strong predictors for a model and going for high order polynomial of predictors you will get higher accuracy.

With H2O, you can create higher order polynomials as below:

  • Look for  ‘interactions’ parameter in GLM model.
  • In the interaction parameters add  list of predictor columns to interact.
When model will be build, all pairwise combinations will be computed for this list. Following is a working sample:
boston = h2o.import_file("")
predictors = boston.columns[:-1]
response = "medv"
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
interactions_list = ['crim', 'dis']
boston_glm = H2OGeneralizedLinearEstimator(interactions = interactions_list)
boston_glm.train(x = predictors, y = response,training_frame = boston)
To explore interactions among categorical variables please do the following:
Thats all, enjoy!!

Binomial classification example in Scala and GLM with H2O

Here is a sample for binomial classification problem using H2O GLM algorithm using Credit Card data set in Scala language.

The following sample is for multinomial classification problem. This sample is created using Spark 2.1.0 with Sparkling Water 2.1.4.

import org.apache.spark.h2o._
import org.apache.spark.SparkFiles
import{H2OFrameSupport, SparkContextSupport, ModelMetricsSupport}
import water.Key
import _root_.hex.glm.GLMModel
import _root_.hex.ModelMetricsBinomial

val hc = H2OContext.getOrCreate(sc)
import hc._
import hc.implicits._

addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_clients.csv")
val creditCardData = new H2OFrame(new File(SparkFiles.get("credit_card_clients.csv")))

val ratios = Array[Double](0.8)
val keys = Array[String]("train.hex", "valid.hex")
val frs = H2OFrameSupport.split(creditCardData, keys, ratios)
val (train, valid) = (frs(0), frs(1))

def buildGLMModel(train: Frame, valid: Frame, response: String)
 (implicit h2oContext: H2OContext): GLMModel = {
 import _root_.hex.glm.GLMModel.GLMParameters.Family
 import _root_.hex.glm.GLM
 import _root_.hex.glm.GLMModel.GLMParameters
 val glmParams = new GLMParameters(Family.binomial)
 glmParams._train = train
 glmParams._valid = valid
 glmParams._response_column = response
 glmParams._alpha = Array[Double](0.5)
 val glm = new GLM(glmParams, Key.make("glmModel.hex"))
 //val glmModel = glm.trainModel().get()

val glmModel = buildGLMModel(train, valid, 'default_payment_next_month)(hc)

// Collect model metrics and evaluate model quality
val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)
val validMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, valid)

// Prediction
addFiles(sc, "/Users/avkashchauhan/learn/deepwater/credit_card_predict.csv")
val creditPredictData = new H2OFrame(new File(SparkFiles.get("credit_card_predict.csv")))

val predictionFrame = glmModel.score(creditPredictData)
var predictonResults = asRDD[DoubleHolder](predictionFrame)


Thats it, enjoy!!

Cross-validation example with time-series data in R and H2O

What is Cross-validation: In k-fold crossvalidation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. learn more at wiki..

When you have time-series data splitting data randomly from random rows does not work because the time part of your data will be mangled so doing cross-validation with time series dataset is done differently.

The following R code script show how it is split first and the passed as validation frame into different algorithms in H2O.


h2o.init(strict_version_check = FALSE)

# show general information on the airquality dataset



print(paste(‘number of months:’,length(unique(airquality$Month)), sep=“”))

# add a year column, so you can create a month, day, year date stamp

airquality$Year <- rep(2017,nrow(airquality))

airquality$Date <- as.Date(with(airquality, paste(Year, Month, Day,sep=“-“)), “%Y-%m-%d”)

# sort the dataset

airquality <- airquality[order(as.Date(airquality$Date, format=“%m/%d/%Y”)),]

# convert the dataset to unix time before converting to an H2OFrame

airquality$Date <- as.numeric(as.POSIXct(airquality$Date, origin=“1970-01-01”, tz = “GMT”))

# convert to an h2o dataframe

air_h2o <- as.h2o(airquality)

# specify the features and the target column

target <- ‘Ozone’

features <- c(“Solar.R”, “Wind”, “Temp”,  “Month”, “Day”, “Date”)

# split dataset in ~half which if you round up is 77 rows (train on the first half of the dataset)

train_1 <- air_h2o[1:ceiling(dim(air_h2o)[1]/2),]

# calculate 14 days in unix time: one day is 86400 seconds in unix time (aka posix time, epoch time)

# use this variable to iterate forward 12 days

add_14_days <- 86400*14

# initialize a counter for the while loop so you can keep track of which fold corresponds to which rmse

counter <- 0

# iterate over the process of testing on the next two weeks

# combine the train_1 and test_1 datasets after each loop

while (dim(train_1)[1] < dim(air_h2o)[1]){

    # get new dataset two weeks out

    # take the last date in Date column and add 14 days to i

    new_end_date <- train_1[nrow(train_1),]$Date + add_14_days

    last_current_date <- train_1[nrow(train_1),]$Date


    # slice with a boolean mask

    mask <- air_h2o[,“Date”] > last_current_date

    temp_df <- air_h2o[mask,]

    mask_2 <- air_h2o[,“Date”] < new_end_date


    # multiply the mask dataframes to get the intersection

    final_mask <- mask*mask_2

    test_1 <- air_h2o[final_mask,]


    # build a basic gbm using the default parameters

    gbm_model <- h2o.gbm(x = features, y = target, training_frame = train_1, validation_frame = test_1, seed = 1234)


    # print the number of rows used for the test_1 dataset

    print(paste(‘number of rows used in test set: ‘, dim(test_1)[1], sep=” “))

    print(paste(‘number of rows used in train set: ‘, dim(train_1)[1], sep=” “))

    # print the validation metrics

    rmse_valid <- h2o.rmse(gbm_model, valid=T)

    print(paste(‘your new rmse value on the validation set is: ‘, rmse_valid,‘ for fold #: ‘, counter, sep=“”))


    # create new training frame

    train_1 <- h2o.rbind(train_1,test_1)

    print(paste(‘shape of new training dataset: ‘,dim(train_1)[1],sep=” “))

    counter <<- counter + 1


Thats all, enjoy!!

Building H2O GLM model using Postgresql database and JDBC driver

Note: Before we jump down, make sure you have postgresql is up and running and database is ready to respond your queries. Check you queries return results as records and are not null.

Download JDBC Driver 42.0.0 JDBC 4:

Note: I have tested H2O with above JDBC driver 4.0 (Build 42.0.0) and Postgresql 9.2.x

In the following test I am connection to DVD Rental DB which is available into Postgresql. Need help to get it working.. visit Here and Here.

Test R (RStudio) for the postgresql connection working:

# Install package if you don't have it
> install.packages("RPostgreSQL")

# User package RPostgreSQL 
> library(RPostgreSQL)

# Code to test database and table:
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, dbname = "dvdrentaldb", host = "localhost", port = 5432,
> user = "avkash", password = "avkash")
> dbExistsTable(con, "actor")

Start H2O with JDBC driver:

$ java -cp postgresql-42.0.0.jre6.jar:h2o.jar water.H2OApp


  • You must have h2o.jar and postgresql-42.0.0.jre6.jar in the same folder as above.
  • You must start h2o first and then connect to running instance of H2O from R as below.
  • I am connecting to a table name payment below
  • I am using table payment to run H2O GLM model

Connecting H2O from R:

> library(h2o)
> h2o.init()
> h2o.init(strict_version_check = FALSE)
> payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
> payment_glm

Here is the full code snippet in working:


payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
|=============================================| 100%
> payment
payment_id customer_id staff_id rental_id amount payment_date
1 17503 341 2 1520 7.99 1.171607e+12
2 17504 341 1 1778 1.99 1.171675e+12
3 17505 341 1 1849 7.99 1.171695e+12
4 17506 341 2 2829 2.99 1.171943e+12
5 17507 341 2 3130 7.99 1.172022e+12
6 17508 341 1 3382 5.99 1.172090e+12

[14596 rows x 6 columns]
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
|=============================================| 100%
> payment_glm
Model Details:

H2ORegressionModel: glm
Model ID: GLM_model_R_1490053774745_2
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.038E-4 ) 5 5
number_of_iterations training_frame
1 0 payment_sql_to_hex

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -10.739680 4.200606
2 payment_id -0.000009 -0.038040
3 customer_id 0.000139 0.024262
4 staff_id 0.103740 0.051872
5 rental_id 0.000001 0.003172
6 payment_date 0.000000 0.026343

H2ORegressionMetrics: glm
** Reported on training data. **

MSE: 5.607411
RMSE: 2.367997
MAE: 1.950123
RMSLE: 0.5182649
Mean Residual Deviance : 5.607411
R^2 : 0.0007319098
Null Deviance :81905.72
Null D.o.F. :14595
Residual Deviance :81845.77
Residual D.o.F. :14590
AIC :66600.46


Thats all, enjoy!!


Getting p-values from GLM model in python

Currently there is no way to get p-value from GLM fitted model in Python, it does work in R.

>>> import numpy as np
>>> df1 = h2o.H2OFrame.from_python(np.random.randn(100,4).tolist(), column_names=list('ABCD'))

Now try the following:

>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> glmfitter3 = H2OGeneralizedLinearEstimator(family="gaussian", solver = "IRLSM", alpha=0, lambda_=0,... compute_p_values=True )
>>> glmfitter3.train(x=['A','B'],y="C",training_frame=df1 )glm Model Build progress: |██ | 100%

Now lets get Model Details:

>> print(glmfitter3)

Model Details

H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_model_python_1473895693010_1
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
– -------- -------- ---------------- ---------------------------- ----------------------------- ---------------------- ------------------------------------------------------
gaussian identity None 2 2 0 Key_Frame__upload_bc7ed024599c9c807ffee0ab12f8457a.hex
ModelMetricsRegressionGLM: glm
Reported on train data. **
MSE: 0.965343913566
RMSE: 0.982519167022
MAE: 0.76322016906
R^2: 0.00763659012861
Mean Residual Deviance: 0.965343913566
Null degrees of freedom: 99
Residual degrees of freedom: 97
Null deviance: 97.2772579041
Residual deviance: 96.5343913566
AIC: 288.260621239
Scoring History:
timestamp duration iteration negative_log_likelihood objective
– ------------------- ---------- ----------- ------------------------- -----------
2016-09-14 16:32:24 0.000 sec 0 97.2773 0.972773


Now we can print the coefficient as below:

>>> print(glmfitter3.coef())

{u'A': -0.08288242123163249, u'B': 0.027858667912495073, u'Intercept': 0.012225954789000987}>>>

Parsing the JSON does get p-values however it would be great to have p-values from some function.

There is not direct method to get p-values from GLM however you can access model JSON to get those values as below:

Once you have you GLM fitted model i.e. glmfitter3, you can get the model JSON as below:

>>> glmfitter3._model_json

Within the model JSON you can look for ‘output’ values as below:

>>> glmfitter3._model_json['output']

Now if you look for ‘coefficients_table’ you will get your p-values as below:

>>> glmfitter3._model_json['output']['coefficients_table']

Coefficients: glm coefficients
names coefficients std_error z_value p_value standardized_coefficients

Intercept 0.012226 0.0820524 0.149002 0.881862 0.0148644
A -0.0828824 0.0970972 -0.853603 0.395428 -0.0868894
B 0.0278587 0.0999033 0.278856 0.780949 0.0283852