Machine Learning adoption for any organization

At this point there is no doubt that any organization can take the advantage of machine learning by applying machine learning into their business process. The significance of machine learning application will depend on how it is applied and what kind of problem you as an organization trying to solve with machine learning. The results are also depend on the experience of your data scientists and software engineer along with the adoption of technology.

In this article we will learn how machine learning development life cycle really looks like and how any organization can build a team to solve their business problem with machine learning. Lets get us started with the following image in mind:

Screen Shot 2018-02-18 at 1.15.52 PM

As you can see above the machine learning process is a continuous process of extracting data from variety of sources then feeding into machine learning engines which generates the model. These models are plugged into business process to produce the results. The results from the models are feed into the process to solve business problems.  These models can produce results independently as well at the edge depending on their usage.

At this point the critical question is to understand what a machine learning development life cycle really look like. What kind of talent is really required to pull it off? What these teams really do while building and applying machine learning?

We will get the answers to above questions as we progress further. If we look at machine learning development life cycle image below we will see the following paradigms:

  1. Collecting data from various resources
  2. After data collecting, making it machine learning ready
  3. The machine learning ready data is feed into “building machine learning” process where a data science heavy team is working on data to produce results.

Screen Shot 2018-02-18 at 1.16.01 PM

Above you can see the the building machine learning process is very data science heavy work however applying machine is mainly the software engineering process. You can use the above understanding to figure out the technical resources needed to implement end to end machine learning pipeline for your organization.

The next question comes in our mind is the separation of building machine learning and applying machine learning. how these two process are difference? What is the end results of machine learning process and how software engineering can apply its out?

Looking at the image below we can see the product of “building machine learning” process is the final or leader model which an enterprise or business and use as the final product. This model is ready to produce results as needed.

Screen Shot 2018-02-18 at 1.16.12 PM

The model can be applied to various consumer, enterprise and industrial use cases to provide edge level intelligence, or in process intelligence where model results are fed into another process. Sometimes the model is fed into another machine learning process to generate further results.

Once we have understood the significance of key individuals in end to end machine learning process, the question in our mind if what the key individual do in day to day process? How to they really engage into the process of building machine learning? What kind of tools and technology they adopt or create to solve organization business problem?

To understand the kind of work data scientists will be doing while building machine learning, we can see their main focus to use and apply as many as machine learning engines along with various algorithms to solve the specific problem. Sometime they create something brand new to solve the problem they have in their hand as there is nothing available, or sometimes they just need to improve an available solution.

Screen Shot 2018-02-18 at 1.16.21 PM
The above image puts together the conceptual idea of various engines, could be used by the team of data scientists in any organization to accomplish their task.

The role of software engineering is critical in overall machine learning pipeline. They help data science process to speed up and refine the process to generate faster results while applying the software engineering methods top of data science.

The image below explains how software engineers can expedite the work of data scientists by create fully automated machine learning system which perform the repetitive tasks of data scientists in full automated fashion. At this point data scientists are open to use their time to solve newer problems and just keep an eye of the automated system to make sure it is working as their expectation.

Screen Shot 2018-02-18 at 1.16.31 PM


Various organization i.e. Google (i.e. CloudML), H2O (i.e. AutoML) has created automated machine learning software which can be utilized by any organization. There are open sources packages also available i.e. Auto-SKLearn, TPOT.

Any organization can follow the above details to adopt machine learning into their organization and generate expected results.

Helpful Articles:

Thank you, all the very best!





RSparkling > The best of R + H2O + Spark

What you get from R + H2O + Spark?

R is great for statistical computing and graphics, and small scale data preparation, H2O is amazing distributed machine learning platform designed for scale and speed and Spark is great for super fast data processing at mega scale. So combining all of these 3 together you get the best of data science, machine learning and data processing, all in one.

rsparkling: The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling WaterSpark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R.

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Sparkling Water integrates H2O’s fast scalable machine learning engine with Spark. With Sparkling Water you can publish Spark data structures (RDDs, DataFrames, Datasets) as H2O’s frames and vice versa, DSL to use Spark data structures as input for H2O’s algorithms. You can create ML applications utilizing Spark and H2O APIs, and Python interface enabling use of Sparkling Water directly from PySpark.

Installation Packages:

Quick Start Script:

options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
sc = spark_connect(master = "local", version = "2.1.0")
h2o_context(sc, strict_version_check = FALSE)

Important Settings for your environment:

  • master = “local” > To start local spark cluster
  • master = “yarn-client” > To start a cluster managed by YARN
  • To get a list of supported Sparkling Water versions: h2o_release_table()
  • When you will call spark_connect() you will see a new “tab” appears
    • Tab “Spark” is used to launch “SparkUI”
    • Tab “Log” is used to collect spark logs
  • If there is any issue with sparklyr and spark version pass exact version above otherwise you dont need to pass version.

Startup Script with config parameters to set executor settings:

These are the settings you will use to get our rsparkling/spark session up and running in RStudio:

options(rsparkling.sparklingwater.version = "2.1.14") 
options(rsparkling.sparklingwater.location = "/Users/avkashchauhan/tools/sw2/sparkling-water-2.1.14/assembly/build/libs/sparkling-water-assembly_2.11-2.1.14-all.jar")
config <- spark_config()
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G”
config$spark.executor.instances = 3  <==== This will create 3 Nodes Instance
sc <- spark_connect(master = "local", config = config, version = '2.1.0')
h2o_context(sc, strict_version_check = FALSE)

Accessing SparkUI:

You can access Spark UI just by clicking  SparkUI button at the spark tab as shown below:

Screen Shot 2017-10-28 at 9.54.48 AM

Accessing H2O FLOW UI:

You just need to pass the command to open H2O FLOW UI on selected browser:


Screen Shot 2017-10-28 at 9.55.03 AM

Building H2O GLM model using rsparkling + sparklyr + H2O:

In This example we are ingesting the famous “CARS & MPG” dataset and building a GLM (Generalized Linear Model) to predict the miles-per-gallon from the given specification of car capabilities:

options(rsparkling.sparklingwater.location = "/tmp/sparkling-water-assembly_2.11-2.1.7-all.jar")
sc <- spark_connect(master = "local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
sciris_tbl <- copy_to(sc, iris)
mtcars_tbl <- copy_to(sc, mtcars, "iris1")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars", overwrite = TRUE)
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
mtcars_glm <- h2o.glm(x = c("wt", "cyl"),mtcars_glm <- h2o.glm(x = c("wt", "cyl"),y = "mpg",training_frame = mtcars_h2o,lambda_search = TRUE)

That’s all, enjoy!!

Scoring H2O MOJO models with spark UDF and Scala

With H2O machine learning the best case is that your machine learning models can be exported as Java code so you can use them for scoring in any platform which supports Java. H2O algorithms generates POJO and MOJO models which does not require H2O runtime to score which is great for any enterprise. You can learn more about H2O POJO and MOJO models here.

Here is the Spark Scala code which shows how to score the H2O MOJO model by loading it from the disk and then using RowData object to pass as row to H2O easyPredict class:

import _root_.hex.genmodel.GenModel
import _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}
import _root_.hex.genmodel.easy.prediction
import _root_.hex.genmodel.MojoModel
import _root_.hex.genmodel.easy.RowData

// Load Mojo
val mojo = MojoModel.load("/Users/avkashchauhan/learn/customers/mojo_bin/")
val easyModel = new EasyPredictModelWrapper(mojo)

// Get Mojo Details
var features = mojo.getNames.toBuffer

// Creating the row
val r = new RowData
r.put("AGE", "68")
r.put("RACE", "2")
r.put("DCAPS", "2")
r.put("VOL", "0")
r.put("GLEASON", "6")

// Performing the Prediction
val prediction = easyModel.predictBinomial(r).classProbabilities

Above the MOJO model is stored into local file system as and it is loaded as resources inside the Scala code.  The full execution of above code is available here.

Following is the simple Java code which shows how you could use the same code to write a Java application to perform scoring based on H2O MOJO Model:

import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;
import java.util.Arrays;

public class main {
  public static void main(String[] args) throws Exception {
    EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load(""));

    hex.genmodel.GenModel mojo = MojoModel.load("");

    System.out.println("isSupervised : " + mojo.isSupervised());
    System.out.println("Columns Names : " + Arrays.toString(mojo.getNames()));
    System.out.println("Number of columns : " + mojo.getNumCols());
    System.out.println("Response ID : " + mojo.getResponseIdx());
    System.out.println("Response Name : " + mojo.getResponseName());

    for (int i = 0; i < mojo.getNumCols(); i++) {
      String[] domainValues = mojo.getDomainValues(i);

    RowData row = new RowData();
    row.put("AGE", "68");
    row.put("RACE", "2");
    row.put("DCAPS", "2");
    row.put("VOL", "0");
    row.put("GLEASON", "6");

    BinomialModelPrediction p = model.predictBinomial(row);
    System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label);
    System.out.print("Class probabilities: ");
    for (int i = 0; i < p.classProbabilities.length; i++) {
      if (i > 0) {

Thats it, enjoy!!

Calculating AUC and GINI model metrics for logistic classification

For logistics classification problem we use AUC metrics to check the model performance. The higher is better however any value above 80% is considered good and over 90% means the model is behaving great.

AUC is an abbreviation for Area Under the Curve. It is used in classification analysis in order to determine which of the used models predicts the classes best. An example of its application are ROC curves. Here, the true positive rates are plotted against false positive rates. You can learn more about AUC in this QUORA discussion.

We will also look for GINI metric which you can learn from wiki.  In this example we will learn how AUC and GINI model metric is calculated using True Positive Results (TPR) and False Positive Results (FPR) values from a given test dataset.

You can get the full working Jupyter Notebook here from my Github.

Lets build a logistic classification model in H2O using the prostate data set:

Preparation of H2O environment and dataset:

## Importing required libraries
import h2o
import sys
import pandas as pd
from h2o.estimators.gbm import H2OGradientBoostingEstimator

## Starting H2O machine learning cluster

## Importing dataset
local_url = ""
df = h2o.import_file(local_url)

## defining feaures and response column
feature_names = df.col_names

## setting our response column to catagorical so our model classify the problem
df[y] = df[y].asfactor()

Now we will be splitting the dataset into 3 sets for training, validation and test:

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])

Setting  H2O GBM Estimator and building GBM Model:

prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",

## Building H2O GBM Model:
prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

## Understand the H2O GBM Model

Generating model performance with training, validation & test datasets:

train_performance = prostate_gbm.model_performance(df_train)
valid_performance = prostate_gbm.model_performance(df_valid)
test_performance = prostate_gbm.model_performance(df_test)

Let’s take a look at the AUC metrics provided by Model performance:


Let’s take a look at the GINI metrics provided by Model performance:


Let generate the predictions using test dataset:

predictions = prostate_gbm.predict(df_test)
## Here we will get the probability for the 'p1' values from the prediction frame:
predict_probability = predictions['p1']

Now we will import required scikit-learn libraries to generate AUC manually:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import random

Lets get the real response results from the test data frame:

actual = df_test[y].as_data_frame()
actual_list = actual['CAPSULE'].tolist()

Now lets get the results probabilities from the prediction frame:

predictions_temp = predict_probability_x['p1'].as_data_frame()
predictions_list = predictions_temp['p1'].tolist()

Calculating False Positive Rate and True Positive Rate:

Lets calculate TPR, FPR and Threshold metrics from the predictions and original data frame
– False Positive Rate (fpr)
– True Positive Rate (tpr)
– Threashold

fpr, tpr, thresholds = roc_curve(actual_list, predictions_list)
roc_auc = auc(fpr, tpr)

Note: Above you will see that our calculated ROC values is exactly same as given by model performance for test dataset. 

Lets plot the AUC Curve using matplotlib:

plt.title('ROC (Receiver Operating Characteristic)')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate (TPR)')
plt.xlabel('False Positive Rate (FPR)')

Screen Shot 2017-10-19 at 10.30.21 PM

This is how GINI metric is calculated from AUC:

GINI = (2 * roc_auc) - 1

Note: Above you will see that our calculated GINI values is exactly same as given by model performance for test dataset.

Thats it, enjoy!!


How R2 error is calculated in Generalized Linear Model

What is R2 (R^2 i.e. R-Squared)?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. … 100% indicates that the model explains all the variability of the response data around its mean. (From here)

You can get the full working jupyter notebook for this article from here directly from my Github.

Even when this article explains how R^2 error is calculated for an H2O GLM (Generalized Linear Model) however same math is use for any other statistical model. So you can use this function anywhere you would want to apply.

Lets build an H2O GLM Model first:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator


local_url = ""
df = h2o.import_file(local_url)

feature_names = df.col_names

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])

prostate_glm = H2OGeneralizedLinearEstimator(model_id = "prostate_glm")

prostate_glm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

Now calculate Model Performance based on training, validation and test data:

train_performance = prostate_glm.model_performance(df_train)
valid_performance = prostate_glm.model_performance(df_valid)
test_performance = prostate_glm.model_performance(df_test)

Now lets check the default R^2 metrics for training, validation and test data:


Now lets get the prediction for the test data which we kept separate:

predictions = prostate_glm.predict(df_test)

Here is the math which is use to calculate the R2 metric for the test data set:

SSE = ((predictions-df_test[y])**2).sum()
y_hat = df_test[y].mean()
SST = ((df_test[y]-y_hat[0])**2).sum()

Now lets get model performance for given test data as below:


Above we can see that both values, one give by model performance for test data and the other we calculated are same.

Thats it, enjoy!!





Saving H2O model object as text locally

Sometimes you may want to store the H2O model object as text to local file system. In this example I will show you how you can save H2O model object to local disk as simple text content. You can get full working jupyter notebook for this example here from my Github.

Based on my experience the following example works fine with python 2.7.12 and python 3.4. I also found that the H2O model object tables were not saved to text file from jupyter notebook however when I ran the same code form command line into python shell, all the content was written perfectly.

Lets build an H2O GBM model using the public PROSTATE dataset (The following script is full working script which will generate the GBM binomial model):

import h2o

local_url = ""
df = h2o.import_file(local_url)

feature_names = df.col_names
df[y] = df[y].asfactor()

df_train, df_valid = df.split_frame(ratios=[0.9])

prostate_gbm = H2OGradientBoostingEstimator(model_id = "prostate_gbm",

prostate_gbm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)

Now we will save the model details to the disk as below:

old_target = sys.stdout
f = open('/Users/avkashchauhan/Downloads/model_output.txt', 'w')
sys.stdout = f

Lets see the content of the local file we have just created in the above step (It is empty):

!cat /Users/avkashchauhan/Downloads/model_output.txt

Now we will launch the following commands which will fill the standard output buffer with the model details as text:

print("Model summary>>>")

Now we will push the standard output buffer to the text file which is created locally:

sys.stdout = old_target

Now we will check back the local file contents and this time you will see that the output of above command is written into the file:

!cat /Users/avkashchauhan/Downloads/model_output.txt

You will see the command output stored into the local text file as below:

Model summary>>>
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  prostate_gbm

ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.036289343297
RMSE: 0.190497620187
LogLoss: 0.170007804527
Mean Per-Class Error: 0.0160045361428
AUC: 0.998865964296
Gini: 0.997731928592
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.487417363665: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 40.36 %

ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.161786079676
RMSE: 0.402226403505
LogLoss: 0.483923658542
Mean Per-Class Error: 0.174208144796
AUC: 0.871040723982
Gini: 0.742081447964
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.205076283533: 
Maximum Metrics: Maximum metrics at their respective thresholds

Gains/Lift Table: Avg response rate: 39.53 %

Scoring History: 
Variable Importances:

Note: If you are thinking what “!” sign does here, so it is used here to run a linux shell command (in this case “cat”  is the linux command) inside jupyter cell.

Thats it, enjoy!!


Using H2O AutoML for Kaggle Porto Seguro Safe Driver Prediction Competition

If you into competitive machine learning you must be visiting Kaggle routinely. Currently you can compete for cash and recognition at the Porto Seguro’s Safe Driver Prediction as well.

I did try to given training dataset (as it is) with H2O AutoML which ran for about 5 hours and I was able to get into top 280th position. If you could transform the dataset properly and run H2O AutoML you may be able to get even higher ranking.

Following is the simplest H2O AutoML python script which you can try as well (Note: Make sure to change the run_automl_for_seconds to the desired time you would want to run the experiment.)

import h2o
import pandas as pd
from h2o.automl import H2OAutoML

train = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroTrain.csv')
test = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroTest.csv')
sub_data = h2o.import_file('/data/avkash/PortoSeguro/PortoSeguroSample_submission.csv')

y = 'target'
x = train.columns

## Time to run the experiment
run_automl_for_seconds = 18000
## Running AML for 4 Hours
aml = H2OAutoML(max_runtime_secs =run_automl_for_seconds)
train_final, valid = train.split_frame(ratios=[0.9])
aml.train(x=x, y =y, training_frame=train_final, validation_frame=valid)

leader_model = aml.leader
pred = leader_model.predict(test_data=test)

pred_pd = pred.as_data_frame()
sub = sub_data.as_data_frame()

sub['target'] = pred_pd
sub.to_csv('/data/avkash/PortoSeguro/PortoSeguroResult.csv', header=True, index=False)

That’s it, enjoy!!