# Calculating AUC and GINI model metrics for logistic classification

For logistics classification problem we use AUC metrics to check the model performance. The higher is better however any value above 80% is considered good and over 90% means the model is behaving great.

AUC is an abbreviation for Area Under the Curve. It is used in classification analysis in order to determine which of the used models predicts the classes best. An example of its application are ROC curves. Here, the true positive rates are plotted against false positive rates. You can learn more about AUC in this QUORA discussion.

We will also look for GINI metric which you can learn from wiki.  In this example we will learn how AUC and GINI model metric is calculated using True Positive Results (TPR) and False Positive Results (FPR) values from a given test dataset.

You can get the full working Jupyter Notebook here from my Github.

Lets build a logistic classification model in H2O using the prostate data set:

### Preparation of H2O environment and dataset:

```## Importing required libraries
import h2o
import sys
import pandas as pd

## Starting H2O machine learning cluster
h2o.init()

## Importing dataset
local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv"
df = h2o.import_file(local_url)

## defining feaures and response column
y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

## setting our response column to catagorical so our model classify the problem
df[y] = df[y].asfactor()```

### Now we will be splitting the dataset into 3 sets for training, validation and test:

```df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)```

### Setting  H2O GBM Estimator and building GBM Model:

```prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",
ntrees=500,
learn_rate=0.001,
max_depth=10,
score_each_iteration=True)

## Building H2O GBM Model:
prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

## Understand the H2O GBM Model
prostate_gbm```

### Generating model performance with training, validation & test datasets:

```train_performance = prostate_gbm.model_performance(df_train)
valid_performance = prostate_gbm.model_performance(df_valid)
test_performance = prostate_gbm.model_performance(df_test)```

### Let’s take a look at the AUC metrics provided by Model performance:

```print(train_performance.auc())
print(valid_performance.auc())
print(test_performance.auc())
print(prostate_gbm.auc())```

### Let’s take a look at the GINI metrics provided by Model performance:

```print(train_performance.gini())
print(valid_performance.gini())
print(test_performance.gini())
print(prostate_gbm.gini())```

### Let generate the predictions using test dataset:

```predictions = prostate_gbm.predict(df_test)
## Here we will get the probability for the 'p1' values from the prediction frame:
predict_probability = predictions['p1']```

Now we will import required scikit-learn libraries to generate AUC manually:

```from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import random```

Lets get the real response results from the test data frame:

```actual = df_test[y].as_data_frame()
actual_list = actual['CAPSULE'].tolist()
print(actual_list)```

Now lets get the results probabilities from the prediction frame:

```predictions_temp = predict_probability_x['p1'].as_data_frame()
predictions_list = predictions_temp['p1'].tolist()
print(predictions_list)```

### Calculating False Positive Rate and True Positive Rate:

Lets calculate TPR, FPR and Threshold metrics from the predictions and original data frame
– False Positive Rate (fpr)
– True Positive Rate (tpr)
– Threashold

```fpr, tpr, thresholds = roc_curve(actual_list, predictions_list)
roc_auc = auc(fpr, tpr)
print(roc_auc)
print(test_performance.auc())```

Note: Above you will see that our calculated ROC values is exactly same as given by model performance for test dataset.

### Lets plot the AUC Curve using matplotlib:

```plt.title('ROC (Receiver Operating Characteristic)')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate (TPR)')
plt.xlabel('False Positive Rate (FPR)')
plt.show()```

### This is how GINI metric is calculated from AUC:

```GINI = (2 * roc_auc) - 1
print(GINI)
print(test_performance.gini())```

Note: Above you will see that our calculated GINI values is exactly same as given by model performance for test dataset.

Thats it, enjoy!!

# Saving H2O model object as text locally

Sometimes you may want to store the H2O model object as text to local file system. In this example I will show you how you can save H2O model object to local disk as simple text content. You can get full working jupyter notebook for this example here from my Github.

Based on my experience the following example works fine with python 2.7.12 and python 3.4. I also found that the H2O model object tables were not saved to text file from jupyter notebook however when I ran the same code form command line into python shell, all the content was written perfectly.

Lets build an H2O GBM model using the public PROSTATE dataset (The following script is full working script which will generate the GBM binomial model):

```import h2o
h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)
df[y] = df[y].asfactor()

df_train, df_valid = df.split_frame(ratios=[0.9])
print(df_train.shape)
print(df_valid.shape)

ntrees=1000,
learn_rate=0.5,
max_depth=20,
stopping_tolerance=0.001,
stopping_rounds=2,
score_each_iteration=True)

prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_gbm```

Now we will save the model details to the disk as below:

```old_target = sys.stdout
sys.stdout = f```

Lets see the content of the local file we have just created in the above step (It is empty):

`!cat /Users/avkashchauhan/Downloads/model_output.txt`

Now we will launch the following commands which will fill the standard output buffer with the model details as text:

```print("Model summary>>> model_object.show()")
prostate_gbm.show()```

Now we will push the standard output buffer to the text file which is created locally:

sys.stdout = old_target

Now we will check back the local file contents and this time you will see that the output of above command is written into the file:

`!cat /Users/avkashchauhan/Downloads/model_output.txt`

You will see the command output stored into the local text file as below:

```Model summary>>> model_object.show()
Model Details
=============
Model Key:  prostate_gbm

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.036289343297
RMSE: 0.190497620187
LogLoss: 0.170007804527
Mean Per-Class Error: 0.0160045361428
AUC: 0.998865964296
Gini: 0.997731928592
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.487417363665:
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 40.36 %

ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.161786079676
RMSE: 0.402226403505
LogLoss: 0.483923658542
Mean Per-Class Error: 0.174208144796
AUC: 0.871040723982
Gini: 0.742081447964
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.205076283533:
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 39.53 %

Scoring History:
Variable Importances:```

Note: If you are thinking what “!” sign does here, so it is used here to run a linux shell command (in this case “cat”  is the linux command) inside jupyter cell.

Thats it, enjoy!!

# Python example of building GLM, GBM and Random Forest Binomial Model with H2O

Here is an example of using H2O machine learning library and then building GLM, GBM and Distributed Random Forest models for categorical response variable.

Lets import h2o library and initialize the H2O machine learning cluster:

```import h2o
h2o.init()```

Importing dataset and getting familiar with it:

```df = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv")
df.summary()
df.col_names```

Lets configure our predictors and response variables from the ingested dataset:

```y = 'CAPSULE'
x = df.col_names
x.remove(y)
print("Response = " + y)
print("Pridictors = " + str(x))```

Now we need to set the response column as categorical or factor:

`df['CAPSULE'] = df['CAPSULE'].asfactor()`

Now we will the levels in our response variable:

```df['CAPSULE'].levels()
[['0', '1']]```

Note: Because there are only 2 levels or values, the model will be called Binomial model.

Now we will split our dataset into training, validation and testing datasets:

```train, valid, test = df.split_frame(ratios=[.8, .1])
print(df.shape)
print(train.shape)
print(valid.shape)
print(test.shape)```

Lets build Generalized Linear Regression (Logistic – response variable is categorical) model first:

```from h2o.estimators.glm import H2OGeneralizedLinearEstimator
glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
glm_logistic.train(x=x, y= y, training_frame=train, validation_frame=valid,
model_id="glm_logistic")```

Now we will take a look at few model metrics:

```glm_logistic.varimp()
Warning: This model doesn't have variable importances```

Lets have a look at model coefficients:

`glm_logistic.coef()`

Lets perform the prediction using the testing dataset:

`glm_logistic.predict(test_data=test)`

Now we are checking the model performance metrics “rmse” based on testing and other datasets:

```print(glm_logistic.model_performance(test_data=test).rmse())
print(glm_logistic.model_performance(test_data=valid).rmse())
print(glm_logistic.model_performance(test_data=train).rmse())```

Now we are checking the model performance metrics “r2” based on testing and other datasets:

```print(glm.model_performance(test_data=test).r2())
print(glm.model_performance(test_data=valid).r2())
print(glm.model_performance(test_data=train).r2())```

Lets build Gradient Boosting Model now:

```from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm.train(x=x, y =y, training_frame=train, validation_frame=valid)```

Now get to know our model metrics, starting with confusion metrics first:

`gbm.confusion_matrix()`

Now have a look at variable importance plots:

`gbm.varimp_plot()`

Now have a look at the variable importance table:

`gbm.varimp()`

Lets build Distributed Random Forest model:

```from h2o.estimators.random_forest import H2ORandomForestEstimator
drf = H2ORandomForestEstimator()
drf.train(x=x, y = y, training_frame=train, validation_frame=valid)```

lets understand random forest model metrics starting confusion metrics:

`drf.confusion_matrix()`

We can have a look at gains and lift table also:

`drf.gains_lift()`

Note:

• We can get all model metrics as other model type as applied.
• We can also get model perform based on training, validation and testing data for all models.

Thats it, enjoy!!

# Visualizing H2O GBM and Random Forest MOJO Models Trees in python

In this example we will build a tree based model first using H2O machine learning library and the save that model as MOJO. Using GraphViz/Dot library we will extract individual trees/cross validated model trees from the MOJO and visualize them. If you are new to H2O MOJO model, learn here.

You can also get full working Ipython Notebook for this example from here.

Lets build the model first using H2O GBM algorithm. You can also use Distributed Random Forest Model as well for tree visualization.

Let’s first import key python models:

```import h2o
import subprocess
from IPython.display import Image```

Now we will be building GBM Model using a public PROSTATE dataset:

```h2o.init()
df = h2o.import_file('https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv')
y = 'CAPSULE'
x = df.col_names
x.remove(y)
df[y] = df[y].asfactor()
train, valid, test = df.split_frame(ratios=[.8,.1])
gbm_cv3.train(x=x, y=y, training_frame=train)

## Getting all cross validated models
all_models = gbm_cv3.cross_validation_models()
print("Total cross validation models: " + str(len(all_models)))```

Now lets set all the default parameters to create the graph tree first and then tree images (in PNG format) in the local disk. Make sure you have a writable path where you can create and save these intermediate files. You also need to provide the path for latest H2O (h2o.jar) which is used to generate MOJO Model.

```mojo_file_name = "/Users/avkashchauhan/Downloads/my_gbm_mojo.zip"
h2o_jar_path= '/Users/avkashchauhan/tools/h2o-3/h2o-3.14.0.3/h2o.jar'
mojo_full_path = mojo_file_name

Now lets definie Image file name which we will generate from the Tree ID.  Based on Tree ID the image file will have my_gbm_tree_ID.png file name

`image_file_name = "/Users/avkashchauhan/Downloads/my_gbm_tree"`
```Now we will be downloading GBM MOJO Model by saving to disk:

Now lets define the function to generate graphViz tree from the saved MOJO model:

```def generateTree(h2o_jar_path, mojo_full_path, gv_file_path, image_file_path, tree_id = 0):
image_file_path = image_file_path + "_" + str(tree_id) + ".png"
result = subprocess.call(["java", "-cp", h2o_jar_path, "hex.genmodel.tools.PrintMojo", "--tree", str(tree_id), "-i", mojo_full_path , "-o", gv_file_path ], shell=False)
result = subprocess.call(["ls",gv_file_path], shell = False)
if result is 0:
print("Success: Graphviz file " + gv_file_path + " is generated.")
else:
print("Error: Graphviz file " + gv_file_path + " could not be generated.")```

Now lets defined the method to generate Tree image as PNG from the saved GraphViz tree:

```def generateTreeImage(gv_file_path, image_file_path, tree_id):
image_file_path = image_file_path + "_" + str(tree_id) + ".png"
result = subprocess.call(["dot", "-Tpng", gv_file_path, "-o", image_file_path], shell=False)
result = subprocess.call(["ls",image_file_path], shell = False)
if result is 0:
print("Success: Image File " + image_file_path + " is generated.")
print("Now you can execute the follow line as-it-is to see the tree graph:")
print("Image(filename='" + image_file_path + "\')")
else:
print("Error: Image file " + image_file_path + " could not be generated.")```

Note: I had to write 2 steps process above because If I put all in 1 step the process hung after graphviz is created.

Now lets generate tree by passing all parameters defined above and proper TREE ID as the last parameter.

```#Just change the tree id in the function below to get which particular tree you want
generateTree(h2o_jar_path, mojo_full_path, gv_file_path, image_file_name, 3)```

Now we will be generating PNG Tree Image from the saved GraphViz content.

```generateTreeImage(gv_file_path, image_file_name, 3)
# Note: If this step hangs, you can look at "dot" active process in osx and try killing it```

Lets visualize the main model tree:

```# Just pass the Tree Image file name depending on your tree

Lets Visualize the first Cross Validation tree (Cross Validation ID- 1)

```# Just pass the Tree Image file name depending on your tree

Lets Visualize the first Cross Validation tree (Cross Validation ID- 2)

```# Just pass the Tree Image file name depending on your tree

Lets Visualize the first Cross Validation tree (Cross Validation ID- 3)

# Just pass the Tree Image file name depending on your tree

After looking at these tree, you can visualize how the decision are made.

Thats it, enjoy!!

# Building Regression and Classification GBM models in Scala with H2O

In the full code below you will learn to build H2O GBM model (Regression and binomial classification) in Scala.

Lets first import all the classes we need for this project:

``````import org.apache.spark.SparkFiles
import org.apache.spark.h2o._
import org.apache.spark.examples.h2o._
import org.apache.spark.sql.{DataFrame, SQLContext}
import water.Key
import java.io.File

import water.support.H2OFrameSupport._

// Create SQL support
implicit val sqlContext = spark.sqlContext
import sqlContext.implicits._
``````

Next we need to start H2O cluster so we can start using H2O APIs:

``````// Start H2O services
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext._
import h2oContext.implicits._``````

Now we need to ingest the data which we can use to perform modeling:

``````// Import prostate data into H2O
val prostateData = new H2OFrame(new File("/Users/avkashchauhan/src/github.com/h2oai/sparkling-water/examples/smalldata/prostate.csv"))

// Understanding our input data
prostateData.names
prostateData.numCols
prostateData.numRows
prostateData.keys
prostateData.key
``````

Now we will import some H2O specific classes we need to perform our actions:

``````import h2oContext.implicits._
import _root_.hex.tree.gbm.GBM
import _root_.hex.tree.gbm.GBMModel.GBMParameters
``````

Lets setup GBM Parameters which will shape our GBM modeling process:

``````val gbmParams = new GBMParameters()
gbmParams._train = prostateData
gbmParams._response_column = 'CAPSULE``````

In above response column setting the column “CAPSULE” is numeric so by default the GBML model will build a regression model. Lets start building GBM Model now:

``````val gbm = new GBM(gbmParams,Key.make("gbmRegModel.hex"))
val gbmRegModel = gbm.trainModel.get
// Same as above
val gbmRegModel = gbm.trainModel().get()``````

Lets get to know our GBM Model and we will see that the type of this model is “regression”:

``gbmRegModel``

Lets perform prediction using GBM Regression Model:

``````val predH2OFrame = gbmRegModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))
``````

Now we will set the input data set to perform GBM classification model. Below we are setting the response column to be a categorical type so all the values in this column becomes enumerator instead of number, this way we can make sure that the GBM model we will build will be a classification model:

``````prostateData.names()
//
// >>> res6: Array[String] = Array(ID, CAPSULE, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON)
// Based on above the CAPSULE is the id = 1
// Note: If we will not set categorical for response variable we will see the following exception
//        - water.exceptions.H2OModelBuilderIllegalArgumentException:
//             - Illegal argument(s) for GBM model: gbmModel.hex.  Details: ERRR on field: _distribution: Binomial requires the response to be a 2-class categorical

withLockAndUpdate(prostateData){ fr => fr.replace(1, fr.vec("CAPSULE").toCategoricalVec)}

gbmParams._response_column = 'CAPSULE``````

We can also set the distribution to have a specific method. In the code below we are setting distribution to have Bernoulli method:

``````import _root_.hex.genmodel.utils.DistributionFamily
gbmParams._distribution = DistributionFamily.bernoulli
``````

Now lets build our GBM  model now:

``````val gbm = new GBM(gbmParams,Key.make("gbmBinModel.hex"))
val gbmBinModel = gbm.trainModel.get
// Same as above
val gbmBinModel = gbm.trainModel().get()``````

Lets check the new model and we will find that it is a classification model and specially binomial classification because it has only 2 classes in its response classes :

``gbmBinModel``

Now lets perform the prediction using our GBM Binomial Classification Model as below:

``````val predH2OFrame = gbmBinModel.score(prostateData)('predict)
val predFromModel = asRDD[DoubleHolder](predH2OFrame).collect.map(_.result.getOrElse(Double.NaN))
``````

Thats all, enjoy!!

# Ranking GBM tree based on scoring metrics

Here is the full python code:

```import h2o
import pandas as pd
h2o.init()

## Import data
df = h2o.import_file('/Users/avkashchauhan/airlines_train.csv')
df.shape
df.col_names
y = "IsDepDelayed"
x = df.col_names
x.remove(y)
print(x)

## Building GBM model
gbm_model.train(x = x, y = y, training_frame=df)

## Understanding model
print(gbm_model)
print("Total trees in the model : " + str(gbm_model.default_params['ntrees']))
scoring_hist = gbm_model.scoring_history()
print(scoring_hist.shape)

## Looking scoring history
scoring_hist

## logloss metric in scoring history:
scoring_hist['training_logloss']
### Difference  in logloss metric from scoring for each tree
diff_df = scoring_hist['training_logloss'].diff()
### Ranking Each Tree
diff_df.rank()

## AUC metric in scoring history:
scoring_hist['training_auc']
### Difference in logloss metric from scoring for each tree
diff_df = scoring_hist['training_auc'].diff()
### Ranking Each Tree
diff_df.rank()```

Here is the link to ipython notebook with example:

https://github.com/Avkash/mldl/blob/master/notebook/h2o/GBM_Tree_Ranking_based_on_metrics.ipynb

That’s it, enjoy!!

# Building GBM model in R and exporting POJO and MOJO model

Get the dataset:

Training:

Test:

Here is the script to build GBM grid model and export MOJO model:

```library(h2o)
h2o.init()

# Importing Dataset

# Display Dataset

# Feature Engineering
actual_log_wagp <- h2o.assign(adult_2013_test[, "LOG_WAGP"], key = "actual_log_wagp")

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
}
predset <- c("RELP", "SCHL", "COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP", "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

# Building GBM Model:
log_wagp_gbm_grid <- h2o.gbm(x = predset,
y = "LOG_WAGP",
model_id = "GBMModel",
distribution = "gaussian",
max_depth = 5,
ntrees = 110,

log_wagp_gbm_grid

# Prediction

You will see GBM_model.java (as POJO Model) and GBM_model.zip (MOJO model) at the location where you will save these models.

Thats it, enjoy!

# Using Cross-validation in Scala with H2O and getting each cross-validated model

Here is Scala code for binomial classification with GLM:

https://aichamp.wordpress.com/2017/04/23/binomial-classification-example-in-scala-and-gbm-with-h2o/

To add cross validation you can do the following:

```def buildGLMModel(train: Frame, valid: Frame, response: String)
(implicit h2oContext: H2OContext): GLMModel = {
import _root_.hex.glm.GLMModel.GLMParameters.Family
import _root_.hex.glm.GLM
import _root_.hex.glm.GLMModel.GLMParameters
val glmParams = new GLMParameters(Family.binomial)
glmParams._train = train
glmParams._valid = valid
glmParams._nfolds = 3  ###### Here is cross-validation ###
glmParams._response_column = response
glmParams._alpha = Array[Double](0.5)
val glm = new GLM(glmParams, Key.make("glmModel.hex"))
glm.trainModel().get()
}```

To look cross-validated model try this:

```scala> glmModel._output._cross_validation_models
res12: Array[water.Key[_ <: water.Keyed[_ <: AnyRef]]] =
Array(glmModel.hex_cv_1, glmModel.hex_cv_2, glmModel.hex_cv_3)```

Now to get each model do the following:

`scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]`

And you will see the following:

```scala> val m1 = DKV.getGet("glmModel.hex_cv_1").asInstanceOf[GLMModel]
m1: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
Description: N/A
model id: glmModel.hex_cv_1
frame id: glmModel.hex_cv_1_train
MSE: 0.14714406
RMSE: 0.38359362
AUC: 0.7167627
logloss: 0.4703465
mean_per_class_error: 0.31526923
default threshold: 0.27434438467025757
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 10704 1651 0.1336 1,651 / 12,355
1 1768 1790 0.4969 1,768 / 3,558
Totals 12472 3441 0.2149 3,419 / 15,913
Gains/Lift Table (Avg response rate: 22.36 %):
Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
1 0.01005467 0....
scala> val m2 = DKV.getGet("glmModel.hex_cv_2").asInstanceOf[GLMModel]
m2: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
Description: N/A
model id: glmModel.hex_cv_2
frame id: glmModel.hex_cv_2_train
MSE: 0.14598908
RMSE: 0.38208517
AUC: 0.7231473
logloss: 0.46717605
mean_per_class_error: 0.31456697
default threshold: 0.29637953639030457
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 11038 1395 0.1122 1,395 / 12,433
1 1847 1726 0.5169 1,847 / 3,573
Totals 12885 3121 0.2025 3,242 / 16,006
Gains/Lift Table (Avg response rate: 22.32 %):
Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
1 0.01005873 0...
scala> val m3 = DKV.getGet("glmModel.hex_cv_3").asInstanceOf[GLMModel]
m3: hex.glm.GLMModel =
Model Metrics Type: BinomialGLM
Description: N/A
model id: glmModel.hex_cv_3
frame id: glmModel.hex_cv_3_train
MSE: 0.14626761
RMSE: 0.38244948
AUC: 0.7239823
logloss: 0.46873763
mean_per_class_error: 0.31437498
default threshold: 0.28522220253944397
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 10982 1490 0.1195 1,490 / 12,472
1 1838 1771 0.5093 1,838 / 3,609
Totals 12820 3261 0.2070 3,328 / 16,081
Gains/Lift Table (Avg response rate: 22.44 %):
Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain
1 0.01001182 0...
scala>```

Thats it, enjoy!!

# Generating ROC curve in SCALA from H2O binary classification models

You can use the following blog to built a binomial classification  GLM model:
To collect model metrics  for training use the following:
`val trainMetrics = ModelMetricsSupport.modelMetrics[ModelMetricsBinomial](glmModel, train)`
Now you can access model AUC (_auc object) as below:
Note: _auc object has array of thresholds, and then for each threshold it has fps and tps
(use tab completion to list them all)
```scala> trainMetrics._auc.
_auc   _gini      _n       _p     _tps      buildCM   defaultCM    defaultThreshold   forCriterion   frozenType   pr_auc   readExternal   reloadFromBytes   tn             tp      writeExternal
_fps   _max_idx   _nBins   _ths   asBytes   clone     defaultErr   fn                 fp             maxF1        read     readJSON       threshold         toJsonString   write   writeJSON```
In the above AUC object:
```_fps  =  false positives
_tps  =  true positives
_ths  =  threshold values
_p    =  actual trues
_n    =  actual false```
Now you can use individual ROC specific values as below to recreate ROC:
```trainMetrics._auc._fps
trainMetrics._auc._tps
trainMetrics._auc._ths```
To print the whole array in the terminal for inspection, you just need the following:
```val dd = trainMetrics._auc._fps
println(dd.mkString(" "))```
You can access true positives and true negatives as below where actual trues and actual false are defined as below:
```_p    =  actual trues

_n    =  actual false```
```scala> trainMetrics._auc._n
res42: Double = 2979.0

scala> trainMetrics._auc._p
res43: Double = 1711.0```
Thats it, enjoy!!