# Calculating AUC and GINI model metrics for logistic classification

For logistics classification problem we use AUC metrics to check the model performance. The higher is better however any value above 80% is considered good and over 90% means the model is behaving great.

AUC is an abbreviation for Area Under the Curve. It is used in classification analysis in order to determine which of the used models predicts the classes best. An example of its application are ROC curves. Here, the true positive rates are plotted against false positive rates. You can learn more about AUC in this QUORA discussion.

We will also look for GINI metric which you can learn from wiki.  In this example we will learn how AUC and GINI model metric is calculated using True Positive Results (TPR) and False Positive Results (FPR) values from a given test dataset.

You can get the full working Jupyter Notebook here from my Github.

Lets build a logistic classification model in H2O using the prostate data set:

### Preparation of H2O environment and dataset:

```## Importing required libraries
import h2o
import sys
import pandas as pd

## Starting H2O machine learning cluster
h2o.init()

## Importing dataset
local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv"
df = h2o.import_file(local_url)

## defining feaures and response column
y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

## setting our response column to catagorical so our model classify the problem
df[y] = df[y].asfactor()```

### Now we will be splitting the dataset into 3 sets for training, validation and test:

```df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)```

### Setting  H2O GBM Estimator and building GBM Model:

```prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",
ntrees=500,
learn_rate=0.001,
max_depth=10,
score_each_iteration=True)

## Building H2O GBM Model:
prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

## Understand the H2O GBM Model
prostate_gbm```

### Generating model performance with training, validation & test datasets:

```train_performance = prostate_gbm.model_performance(df_train)
valid_performance = prostate_gbm.model_performance(df_valid)
test_performance = prostate_gbm.model_performance(df_test)```

### Let’s take a look at the AUC metrics provided by Model performance:

```print(train_performance.auc())
print(valid_performance.auc())
print(test_performance.auc())
print(prostate_gbm.auc())```

### Let’s take a look at the GINI metrics provided by Model performance:

```print(train_performance.gini())
print(valid_performance.gini())
print(test_performance.gini())
print(prostate_gbm.gini())```

### Let generate the predictions using test dataset:

```predictions = prostate_gbm.predict(df_test)
## Here we will get the probability for the 'p1' values from the prediction frame:
predict_probability = predictions['p1']```

Now we will import required scikit-learn libraries to generate AUC manually:

```from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import random```

Lets get the real response results from the test data frame:

```actual = df_test[y].as_data_frame()
actual_list = actual['CAPSULE'].tolist()
print(actual_list)```

Now lets get the results probabilities from the prediction frame:

```predictions_temp = predict_probability_x['p1'].as_data_frame()
predictions_list = predictions_temp['p1'].tolist()
print(predictions_list)```

### Calculating False Positive Rate and True Positive Rate:

Lets calculate TPR, FPR and Threshold metrics from the predictions and original data frame
– False Positive Rate (fpr)
– True Positive Rate (tpr)
– Threashold

```fpr, tpr, thresholds = roc_curve(actual_list, predictions_list)
roc_auc = auc(fpr, tpr)
print(roc_auc)
print(test_performance.auc())```

Note: Above you will see that our calculated ROC values is exactly same as given by model performance for test dataset.

### Lets plot the AUC Curve using matplotlib:

```plt.title('ROC (Receiver Operating Characteristic)')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate (TPR)')
plt.xlabel('False Positive Rate (FPR)')
plt.show()```

### This is how GINI metric is calculated from AUC:

```GINI = (2 * roc_auc) - 1
print(GINI)
print(test_performance.gini())```

Note: Above you will see that our calculated GINI values is exactly same as given by model performance for test dataset.

Thats it, enjoy!!