Two never-miss very informative tutorials on Driverless AI

1. Automatic Feature Engineering with Driverless AI:

Dmitry Larko, Kaggle Grandmaster and Senior Data Scientist at H2O.ai, will showcase what he is doing with feature engineering, how he is doing it, and why it is important in the machine learning realm. He will delve into the workings of H2O.ai’s new product, Driverless AI, whose automatic feature engineering increases the accuracy of models and frees up approximately 80% of the data practitioners’ time – thus enabling them to draw actionable insights from the models built by Driverless AI.  You will see:

  • Overview of feature engineering
  • Real-time demonstration of feature engineering examples
  • Interpretation and reason codes of final models

2. Machine Learning Interpretability with Driverless AI:

In this video, Patrick showcases several approaches beyond the error measures and assessment plots typically used to interpret deep learning and machine learning models and results. Wherever possible, interpretability approaches are deconstructed into more basic components suitable for human storytelling: complexity, scope, understanding, and trust. You will see:

  • Data visualization techniques for representing high-degree interactions and nuanced data structures.
  • Contemporary linear model variants that incorporate machine learning and are appropriate for use in regulated industry.
  • Cutting edge approaches for explaining extremely complex deep learning and machine learning models.

Thats it, enjoy!!

 

Advertisements

Getting all categorical for predictors in H2O POJO and MOJO models

Here is the Java/Scala code snippet which shows how you can get the categorical values for each enum/factor predictor from H2O POJO and MOJO Models:

to get the list of all column names in your POJO/MOJO model, you can try the following:

Imports:

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

POJO:

## First use the POJO model class as below:
private static String modelClassName = "gbm_prostate_binomial";

##Then you can GenModel class to get info you are looking for as below:
hex.genmodel.GenModel rawModel;
rawModel = (hex.genmodel.GenModel) Class.forName(modelClassName).newInstance();

## Now you can get the results as below:
System.out.println("isSupervised : " + rawModel.isSupervised());
System.out.println("Columns Names :  " + Arrays.toString(rawModel.getNames()));
System.out.println("Response ID : " + rawModel.getResponseIdx());
System.out.println("Number of columns : " + rawModel.getNumCols());
System.out.println("Response Name : " + rawModel.getResponseName());

## Printing all categorical values for each predictors
for (int i = 0; i < rawModel.getNumCols(); i++) 
{
 String[] domainValues = rawModel.getDomainValues(i);
 System.out.println(Arrays.toString(domainValues));
}
Output Results:
isSupervised : true
Column Names : [ID, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON]
Response ID : 8
Number of columns : 8
null
null
[0, 1, 2]
null
null
null
null
null
Note: For all null values means the predictor was numeric values and all the categorical values are listed for the each enum/factor predictor.

MOJO:

## Lets assume you have MOJO model as gbm_prostate_binomial.zip
## You would need to load your model as below:
hex.genmodel.GenModel mojo = MojoModel.load("gbm_prostate_binomial.zip");

## Now you can get list of predictors as below:
System.out.println("isSupervised : " + mojo.isSupervised());
System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));
System.out.println("Number of columns : " + mojo.getNumCols());
System.out.println("Response ID : " + mojo.getResponseIdx());
System.out.println("Response Name : " + mojo.getResponseName());

## Printing all categorical values for each predictors
for (int i = 0; i < mojo.getNumCols(); i++) {
 String[] domainValues = mojo.getDomainValues(i);
 System.out.println(Arrays.toString(domainValues));
 }
Output Results:
isSupervised : true
Column Names : [ID, AGE, RACE, DPROS, DCAPS, PSA, VOL, GLEASON]
Response ID : 8
Number of columns : 8
null
null
[0, 1, 2]
null
null
null
null
null
Note: For all null values means the predictor was numeric values and all the categorical values are listed for the each enum/factor predictor.

To can get help on using MOJO and POJO models visit the following:

That’s it, enjoy!!

Scoring with H2O MOJO model at command line with Java

If you are using H2O MOJO model you can use that to scoring in python or any other language just by using java runtime. This is a quick hack way to do the scoring on command line or from Python. Here are few example:

What you will have:

  • H2O MOJO model (it will be i.e. gbm_prostate_new.zip)
  • H2O supported class file for scoring i.e.  h2o-genmodel.jar
  • Your data set in JSON format to score i.e. ‘{“AGE”:”68″,”RACE”:”2″, “DCAPS”:”2″, “VOL”:”0″,”GLEASON”:”6″ }’

 

Here is command line way to perform scoring:

$ java -Xmx4g -cp .:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/h2o-genmodel.jar:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring:genmodel.jar:/ water.util.H2OPredictor /Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/gbm_prostate_new.zip ‘{\”AGE\”:\”68\”, \”RACE\”:\”2\”, \”DCAPS\”:\”2\”, \”VOL\”:\”0\”,\”GLEASON\”:\”6\” }’

Here is the results of above command:

{“labelIndex”:1,”label”:”1″,”classProbabilities”:[0.44056667027822005,0.55943332972178]}

Here is python code to score by launch external Java process:

> import subprocess

> gen_model_arg = ‘.:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/h2o-genmodel.jar:/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring:genmodel.jar:/’> h2o_predictor_class = ‘water.util.H2OPredictor’
> mojo_model_args = ‘/Users/avkashchauhan/src/github.com/h2oai/h2o-tutorials/tutorials/python_mojo_scoring/gbm_prostate_new.zip’
> json_data = {“AGE”:”68″,”RACE”:”2″, “DCAPS”:”2″, “VOL”:”0″,”GLEASON”:”6″}

Calling subprocess module:

> output = subprocess.check_output([“java” , “-Xmx4g”, “-cp”, gen_model_arg, h2o_predictor_class,
mojo_model_args, json_data], shell=False).decode()

## Generating output

> output

u'[ {"labelIndex":0,"label":"0",
    "classProbabilities":[0.8378244965684887,0.1621755034315113]} 
  ]\n'

Thats it, enjoy!!

 

Getting predictors from H2O POJO and MOJO models in Java and Scala

Here is the Java/Scala code snippet which shows how you can get the predictors and response details from H2O POJO and MOJO Models:

to get the list of all column names in your POJO/MOJO model, you can try the following:

Imports:

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

POJO:

## First use the POJO model class as below:
private static String modelClassName = "gbm_prostate_binomial";

##Then you can GenModel class to get info you are looking for as below:
hex.genmodel.GenModel rawModel;
rawModel = (hex.genmodel.GenModel) Class.forName(modelClassName).newInstance();

## Now you can get the results as below:
System.out.println("isSupervised : " + rawModel.isSupervised());
System.out.println("Columns Names :  " + Arrays.toString(rawModel.getNames()));

MOJO:

## Lets assume you have MOJO model as gbm_prostate_binomial.zip
## You would need to load your model as below:
hex.genmodel.GenModel mojo = MojoModel.load("gbm_prostate_binomial.zip");

## Now you can get list of predictors as below:
System.out.println("isSupervised : " + mojo.isSupervised());
System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));

To can get help on using MOJO and POJO models visit the following:

That’s it, enjoy!!

Ranking GBM tree based on scoring metrics

Here is the full python code:

import h2o
import pandas as pd
h2o.init()

## Import data
df = h2o.import_file('/Users/avkashchauhan/airlines_train.csv')
df.shape
df.col_names
y = "IsDepDelayed"
x = df.col_names
x.remove(y)
print(x)

## Building GBM model
from h2o.estimators.gbm import H2OGradientBoostingEstimator
gbm_model = H2OGradientBoostingEstimator()
gbm_model.train(x = x, y = y, training_frame=df)

## Understanding model
print(gbm_model)
print("Total trees in the model : " + str(gbm_model.default_params['ntrees']))
scoring_hist = gbm_model.scoring_history()
print(scoring_hist.shape)

## Looking scoring history
scoring_hist

## logloss metric in scoring history:
scoring_hist['training_logloss']
### Difference  in logloss metric from scoring for each tree
diff_df = scoring_hist['training_logloss'].diff()
### Ranking Each Tree
diff_df.rank()

## AUC metric in scoring history:
scoring_hist['training_auc']
### Difference in logloss metric from scoring for each tree
diff_df = scoring_hist['training_auc'].diff()
### Ranking Each Tree
diff_df.rank()

Here is the link to ipython notebook with example:

https://github.com/Avkash/mldl/blob/master/notebook/h2o/GBM_Tree_Ranking_based_on_metrics.ipynb

That’s it, enjoy!!

Python groupBy example with H2O

Here is the code snipped where how to perform a function on the grouped by values on a particular column:

> df = h2o.import_file(“/Users/avkashchauhan/prostate.csv”)

> df.col_names

[u’ID’, u’CAPSULE’, u’AGE’, u’RACE’, u’DPROS’, u’DCAPS’, u’PSA’, u’VOL’, u’GLEASON’]

> df

 ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
 1 0 65 1 2 1 1.4 0 6
 2 0 72 1 3 2 6.7 0 7
 3 0 70 1 1 2 4.9 0 6
 4 0 76 2 2 1 51.2 20 7
 5 0 69 1 1 1 12.3 55.9 6
 6 1 71 1 3 2 3.3 0 8
 7 0 68 2 4 2 31.9 0 7
 8 0 61 2 4 2 66.7 27.2 7
 9 0 69 1 1 1 3.9 24 7
 10 0 68 2 1 2 13 0 6

> print(df[‘GLEASON’].unique().shape)

(7,1)

> df[‘GLEASON’].unique()
C1

8

0

6

9
7
4
5
> x = df.group_by(by=['GLEASON'])
> y = x.sum(col="DCAPS",na="all").get_frame()
> y.shape(7, 2)
> y
GLEASON sum_DCAPS
0 2
4 1
5 67
6 147
7 146
8 40
9 18

 

That’s it, enjoy!!

Renaming H2O data frame column name in R

Following is the code snippet showing how you can rename a column in H2O data frame in R:

> train.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
  |======================================================| 100%

> train.hex
  sepal_len sepal_wid petal_len petal_wid class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
[150 rows x 5 columns] 

> h2o.names(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "class"    

> h2o.colnames(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "class" 

## Now use the index starting from 1 and if you can change the column name as below
## Changing "class" to "new_class"

> names(train.hex)[5] = c("new_class")

# Checking Result:
> h2o.colnames(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "new_class"
> h2o.names(train.hex)
[1] "sepal_len" "sepal_wid" "petal_len" "petal_wid" "new_class"  

## Now changing "sepal_len" to "sepal_len_new"
> names(train.hex)[1] = c("sepal_len_new")

> h2o.names(train.hex)
[1] "sepal_len_new" "sepal_wid" "petal_len" "petal_wid" "new_class"    

> h2o.colnames(train.hex)
[1] "sepal_len_new" "sepal_wid" "petal_len" "petal_wid" "new_class"    

> train.hex
  sepal_len_new sepal_wid petal_len petal_wid new_class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa

 

That’s it, enjoy!!