How R2 error is calculated in Generalized Linear Model

What is R2 (R^2 i.e. R-Squared)?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. … 100% indicates that the model explains all the variability of the response data around its mean. (From here)

You can get the full working jupyter notebook for this article from here directly from my Github.

Even when this article explains how R^2 error is calculated for an H2O GLM (Generalized Linear Model) however same math is use for any other statistical model. So you can use this function anywhere you would want to apply.

Lets build an H2O GLM Model first:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

prostate_glm = H2OGeneralizedLinearEstimator(model_id = "prostate_glm")

prostate_glm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_glm

Now calculate Model Performance based on training, validation and test data:

train_performance = prostate_glm.model_performance(df_train)
valid_performance = prostate_glm.model_performance(df_valid)
test_performance = prostate_glm.model_performance(df_test)

Now lets check the default R^2 metrics for training, validation and test data:

print(train_performance.r2())
print(valid_performance.r2())
print(test_performance.r2())
print(prostate_glm.r2())

Now lets get the prediction for the test data which we kept separate:

predictions = prostate_glm.predict(df_test)

Here is the math which is use to calculate the R2 metric for the test data set:

SSE = ((predictions-df_test[y])**2).sum()
y_hat = df_test[y].mean()
SST = ((df_test[y]-y_hat[0])**2).sum()
1-SSE/SST

Now lets get model performance for given test data as below:

print(test_performance.r2())

Above we can see that both values, one give by model performance for test data and the other we calculated are same.

Thats it, enjoy!!

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s