Variable Importance and how it is calculated?

What is variable importance (VI):

VI represents the statistical significance of each variable in the data with respect to its affect on the generated model. VI actually is each predictor ranking based on the contribution predictors make to the model. This technique helps data scientists to weed out certain predictors which are contributing to nothing instead adds time to process. Sometime user thinks a variable must contribute to the model and its VI results are very poor, feature engineering can be done to improve predictor existence.

Here is an example of Variable Importance chart and table from H2O machine learning platform:


Question: How Variable Importance is calculated? 

Answer: Variable importance is calculated by sum of the decrease in error when split by a variable. Then the relative importance is the variable importance divided by the highest variable importance value so that values are bounded between 0 and 1.

Question: Is it safe to conclude that zero relative importance means zero contribution to the model? 

Answer: With variable importance if a certain variable or a group of variables importance is shows as 0.0000 it means they’ve never split by the column. Thats why their relative importance is 0.00000 and their contribution to model will be considered zero.

Question: Is it safe to remove zero relative importance variables from the predictor set when building the model?

Answer: Yes, it is safe to remove variables with zero importance as they are contributing zero to model and taking time to process the data. Also removing these zero relative importance predictors shouldn’t deteriorate model performance.

Question: In Partial Dependency plot (PDP) char what to conclude if it is flat?

Answer: In the PDP chart, when changing the values for the variable, if it doesn’t affect the probability coming out of the model and remains flat, it is safe to assume that this particular variable doesn’t contribute to the model. Note sometimes there is very small difference in variables i.e. 0.210000 and 0.210006 which is hard to find unless you scan all predictors and plot another chart by removing all top important variables to highlight very small changes. Overall you can experiment the tail predictors importance by keeping in and out of your model building step to see how it changes and if that is of any significant.



Cross-validation example with time-series data in R and H2O

What is Cross-validation: In k-fold crossvalidation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. learn more at wiki..

When you have time-series data splitting data randomly from random rows does not work because the time part of your data will be mangled so doing cross-validation with time series dataset is done differently.

The following R code script show how it is split first and the passed as validation frame into different algorithms in H2O.


h2o.init(strict_version_check = FALSE)

# show general information on the airquality dataset



print(paste(‘number of months:’,length(unique(airquality$Month)), sep=“”))

# add a year column, so you can create a month, day, year date stamp

airquality$Year <- rep(2017,nrow(airquality))

airquality$Date <- as.Date(with(airquality, paste(Year, Month, Day,sep=“-“)), “%Y-%m-%d”)

# sort the dataset

airquality <- airquality[order(as.Date(airquality$Date, format=“%m/%d/%Y”)),]

# convert the dataset to unix time before converting to an H2OFrame

airquality$Date <- as.numeric(as.POSIXct(airquality$Date, origin=“1970-01-01”, tz = “GMT”))

# convert to an h2o dataframe

air_h2o <- as.h2o(airquality)

# specify the features and the target column

target <- ‘Ozone’

features <- c(“Solar.R”, “Wind”, “Temp”,  “Month”, “Day”, “Date”)

# split dataset in ~half which if you round up is 77 rows (train on the first half of the dataset)

train_1 <- air_h2o[1:ceiling(dim(air_h2o)[1]/2),]

# calculate 14 days in unix time: one day is 86400 seconds in unix time (aka posix time, epoch time)

# use this variable to iterate forward 12 days

add_14_days <- 86400*14

# initialize a counter for the while loop so you can keep track of which fold corresponds to which rmse

counter <- 0

# iterate over the process of testing on the next two weeks

# combine the train_1 and test_1 datasets after each loop

while (dim(train_1)[1] < dim(air_h2o)[1]){

    # get new dataset two weeks out

    # take the last date in Date column and add 14 days to i

    new_end_date <- train_1[nrow(train_1),]$Date + add_14_days

    last_current_date <- train_1[nrow(train_1),]$Date


    # slice with a boolean mask

    mask <- air_h2o[,“Date”] > last_current_date

    temp_df <- air_h2o[mask,]

    mask_2 <- air_h2o[,“Date”] < new_end_date


    # multiply the mask dataframes to get the intersection

    final_mask <- mask*mask_2

    test_1 <- air_h2o[final_mask,]


    # build a basic gbm using the default parameters

    gbm_model <- h2o.gbm(x = features, y = target, training_frame = train_1, validation_frame = test_1, seed = 1234)


    # print the number of rows used for the test_1 dataset

    print(paste(‘number of rows used in test set: ‘, dim(test_1)[1], sep=” “))

    print(paste(‘number of rows used in train set: ‘, dim(train_1)[1], sep=” “))

    # print the validation metrics

    rmse_valid <- h2o.rmse(gbm_model, valid=T)

    print(paste(‘your new rmse value on the validation set is: ‘, rmse_valid,‘ for fold #: ‘, counter, sep=“”))


    # create new training frame

    train_1 <- h2o.rbind(train_1,test_1)

    print(paste(‘shape of new training dataset: ‘,dim(train_1)[1],sep=” “))

    counter <<- counter + 1


Thats all, enjoy!!

Running python and pysparkling with Zeppelin and YARN on Hadoop

Apache Zeppelin is very useful to use cell based notebooks (similar to jupyter) to work with various applications i.e. spark, python, hive, hbase etc by using various interpreters.
With H2O and Sparkling Water you can use Zeppelin on Hadoop cluster with YARN, and then could use Python or Pysparkling to submit jobs.
Here are the steps using Pyspakling with YARN on a hadoop cluster.
1. Get the latest build of sparkling water from here
2. Download and unzip the correct Sparkling Water version comparable with the Spark version into one of the edge node in your Hadoop cluster.
3. Set the following environment variables to the right path before running services:
export MASTER=”yarn-client” // To submit to the Yarn cluster
export SPARK_HOME=“path_to_the_directory_where_spark_unzipped”
export HADOOP_CONF_DIR=“path_to_the_hadoop_installation”export SPARK_SUBMIT_OPTIONS=”–packages ai.h2o:sparkling-water-examples_2.11:2.1.0”
export PYTHONPATH=“_path_to_where_python_installed”
export SPARKLING_EGG=$(ls -t /sparkling-water-2.1.0/py/build/dist/h2o_pysparkling*.egg | head -1)
//path to the Sparkling egg file needs to be updated above
Please make sure to check above version values to reflect the following:
  • 2.11-> refers to the scala version.
  • 2.1.0 —> refers to the spark version.
4. Set the “spark.executor.memory 4g” in Zeppelin either in the configuration file or in the Zeppelin UI if Error 143 is seen while starting the zeppelin server.
Note: To configure it in the Zeppelin UI, goto the dropdown next to the user at theTop right corner , select Interpreters and in the Spark section either edit or add the configuration.
5. Start the Zeppelin server using the command below. This would start Zeppelin in a Yarn container.
bin/ -Pspark-2.1
6. In Zeppelin notebook, create a new note with the markdown as below and add the path to the egg file. This will add the dependency and the classes of pysparkling.
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
7. Now, one can start calling pysparkling API’s like below:
sc.addPyFile(“_path_to_the egg_file_on_disk/h2o_pysparkling_2.1-2.1.99999-py2.7.egg”)
from pysparkling import *
from pyspark import SparkContext
from pyspark.sql import SQLContext
import h2o hc = H2OContext.getOrCreate(sc)
8. To use the scala Sparkling water, one does not need to add dependency explicitly in the note in Zeppelin. A sample script would look like


import org.apache.spark.h2o._
val rdd = sc.parallelize(1 to 1000, 100).map( v => IntHolder(Some(v)))
val h2oContext = H2OContext.getOrCreate(sc)
Thats all, enjoy!!

Building H2O GLM model using Postgresql database and JDBC driver

Note: Before we jump down, make sure you have postgresql is up and running and database is ready to respond your queries. Check you queries return results as records and are not null.

Download JDBC Driver 42.0.0 JDBC 4:

Note: I have tested H2O with above JDBC driver 4.0 (Build 42.0.0) and Postgresql 9.2.x

In the following test I am connection to DVD Rental DB which is available into Postgresql. Need help to get it working.. visit Here and Here.

Test R (RStudio) for the postgresql connection working:

# Install package if you don't have it
> install.packages("RPostgreSQL")

# User package RPostgreSQL 
> library(RPostgreSQL)

# Code to test database and table:
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv, dbname = "dvdrentaldb", host = "localhost", port = 5432,
> user = "avkash", password = "avkash")
> dbExistsTable(con, "actor")

Start H2O with JDBC driver:

$ java -cp postgresql-42.0.0.jre6.jar:h2o.jar water.H2OApp


  • You must have h2o.jar and postgresql-42.0.0.jre6.jar in the same folder as above.
  • You must start h2o first and then connect to running instance of H2O from R as below.
  • I am connecting to a table name payment below
  • I am using table payment to run H2O GLM model

Connecting H2O from R:

> library(h2o)
> h2o.init()
> h2o.init(strict_version_check = FALSE)
> payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
> payment_glm

Here is the full code snippet in working:


payment = h2o.import_sql_table(connection_url = “jdbc:postgresql://localhost:5432/h2odb?&useSSL=false”, table= “payment”, username = “avkash”, password = “avkash”)
|=============================================| 100%
> payment
payment_id customer_id staff_id rental_id amount payment_date
1 17503 341 2 1520 7.99 1.171607e+12
2 17504 341 1 1778 1.99 1.171675e+12
3 17505 341 1 1849 7.99 1.171695e+12
4 17506 341 2 2829 2.99 1.171943e+12
5 17507 341 2 3130 7.99 1.172022e+12
6 17508 341 1 3382 5.99 1.172090e+12

[14596 rows x 6 columns]
> aa = names(payment)[-5]
> payment_glm = h2o.glm(x = aa, y = “amount”, training_frame = payment)
|=============================================| 100%
> payment_glm
Model Details:

H2ORegressionModel: glm
Model ID: GLM_model_R_1490053774745_2
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.038E-4 ) 5 5
number_of_iterations training_frame
1 0 payment_sql_to_hex

Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -10.739680 4.200606
2 payment_id -0.000009 -0.038040
3 customer_id 0.000139 0.024262
4 staff_id 0.103740 0.051872
5 rental_id 0.000001 0.003172
6 payment_date 0.000000 0.026343

H2ORegressionMetrics: glm
** Reported on training data. **

MSE: 5.607411
RMSE: 2.367997
MAE: 1.950123
RMSLE: 0.5182649
Mean Residual Deviance : 5.607411
R^2 : 0.0007319098
Null Deviance :81905.72
Null D.o.F. :14595
Residual Deviance :81845.77
Residual D.o.F. :14590
AIC :66600.46


Thats all, enjoy!!


Restoring DVD rental database into postgresql

Get postgresql ready:

Now make sure you have a postgresql installed and running. If you need help please visit my blog:

Get the DVD Rental database:

Next, please download DVD Rental Sample Database from the link below:

Note: The database file is in zipformat ( so you need to extract it to  dvdrental.tar. You dont need to untar it, just keep the .tar file.



$ pwd


$ mkdir dvdrentaldb

$ initdb dvdrentaldb

Make sure the database dvdrentaldb is initialize and fully ready with username avkash.

Restore the Database:

$ pg_restore -U avkash -d dvdrentaldb dvdrental.tar

Now verify the database:

$ psql -U avkash dvdrentaldb

You will have access to postgresql shell and then you can run command as below:

psql (9.6.2)
Type "help" for help.

h2odb=# \dt
 List of relations
 Schema | Name | Type | Owner
 public | actor | table | avkash
 public | address | table | avkash
 public | category | table | avkash
 public | city | table | avkash
 public | country | table | avkash
 public | customer | table | avkash
 public | film | table | avkash
 public | film_actor | table | avkash
 public | film_category | table | avkash
 public | inventory | table | avkash
 public | language | table | avkash
 public | payment | table | avkash
 public | rental | table | avkash
 public | staff | table | avkash
 public | store | table | avkash
(15 rows)

Thats all, enjoy!!


Setup postgresql database on OSX

Install postgres on OSX

$ brew install postgres

$ pg_ctl –version

pg_ctl (PostgreSQL) 9.6.2

Now create a folder name h2odb as below

$ pwd


$ mkdir h2odb

Now initialize the database into h2odb as below:

$ initdb h2odb 

The files belonging to this database system will be owned by user "avkashchauhan".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory h2odb ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

pg_ctl -D h2odb -l logfile start

Now visit inside h2odb folder and edit pg_hba.conf by added the following at the end of the file:

$ vi pg_hba.conf

Add/append the following in pg_hba.conf where h2odb is database name and avkash is the database user name:

local h2odb avkash md5
host h2odb avkash md5
local all all md5
host all all ::1/32 md5

Start postgres with h2odb as database (You must be outside h2odb folder or add full path to it):

$ pg_ctl -D h2odb -l logfile start

Now run the following command to setup user avkash with password avkash for the database h2odb as below:

$ psql postgres -c “create user avkash PASSWORD ‘avkash'”

$ psql postgres -c “create database h2odb with owner avkash”

$ psql postgres -c “alter role avkash superuser”

$ psql h2odb -U avkash -c “select version();”

Now you can access the h2odb as below where parameters are “-U  user_name data_base_name” :

$ psql -U avkash h2odb

Above command will give you access to postgres shell where you can run command as:

> \l  (list databases) 
> \dt (list tables for the selected database)
> \?  (Get Help)
> select * from table_name;

To check if postgres is up and running:

$ ps -ef | grep postgres

Stop postgres with h2odb as database

$ pg_ctl -D h2odb -l logfile stop

Thats all!! Enjoy it..



Just upgraded Tensorflow 1.0.1 and Keras 2.0.1

$ pip install –upgrade keras –user

Collecting keras
 Downloading Keras-2.0.1.tar.gz (192kB)
 100% |████████████████████████████████| 194kB 2.9MB/s
Requirement already up-to-date: theano in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: pyyaml in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: six in ./.local/lib/python2.7/site-packages (from keras)
Requirement already up-to-date: numpy>=1.7.1 in /usr/local/lib/python2.7/dist-packages (from theano->keras)
Collecting scipy>=0.11 (from theano->keras)
 Downloading scipy-0.19.0-cp27-cp27mu-manylinux1_x86_64.whl (45.0MB)
 100% |████████████████████████████████| 45.0MB 34kB/s
Building wheels for collected packages: keras
 Running bdist_wheel for keras ... done
 Stored in directory: /home/avkash/.cache/pip/wheels/fa/15/f9/57473734e407749529bf55e6b5038640dc7279d5718b2c368a
Successfully built keras
Installing collected packages: keras, scipy
 Found existing installation: Keras 1.2.2
 Uninstalling Keras-1.2.2:
 Successfully uninstalled Keras-1.2.2
Successfully installed keras-2.0.1 scipy-0.19.0

$ pip install –upgrade tensorflow-gpu –user

Collecting tensorflow-gpu
 Downloading tensorflow_gpu-1.0.1-cp27-cp27mu-manylinux1_x86_64.whl (94.8MB)
 100% |████████████████████████████████| 94.8MB 16kB/s
Requirement already up-to-date: mock>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: numpy>=1.11.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: protobuf>=3.1.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: wheel in /usr/lib/python2.7/dist-packages (from tensorflow-gpu)
Requirement already up-to-date: six>=1.10.0 in ./.local/lib/python2.7/site-packages (from tensorflow-gpu)
Requirement already up-to-date: funcsigs>=1; python_version < "3.3" in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow-gpu)
Requirement already up-to-date: pbr>=0.11 in ./.local/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow-gpu)
Requirement already up-to-date: setuptools in ./.local/lib/python2.7/site-packages (from protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: appdirs>=1.4.0 in ./.local/lib/python2.7/site-packages (from setuptools->protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: packaging>=16.8 in ./.local/lib/python2.7/site-packages (from setuptools->protobuf>=3.1.0->tensorflow-gpu)
Requirement already up-to-date: pyparsing in ./.local/lib/python2.7/site-packages (from packaging>=16.8->setuptools->protobuf>=3.1.0->tensorflow-gpu)
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.0.1

$ python -c ‘import keras as tf;print tf.version

Using TensorFlow backend.
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally

$ python -c ‘import tensorflow as tf;print tf.version

I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally
I tensorflow/stream_executor/] successfully opened CUDA library locally

Deep Learning session from Google Next 2017

TensorFlow and Deep Learning without a PhD:

With TensorFlow, deep machine learning transitions from an area of research to mainstream software engineering. In this video, Martin Gorner demonstrates how to construct and train a neural network that recognizes handwritten digits. Along the way, he’ll describe some “tricks of the trade” used in neural network design, and finally, he’ll bring the recognition accuracy of his model above 99%.

Part 1:

Part 2:


Concurrent model building in H2O using Parallel

Here is the full code snippet which shows how to build any model concurrently using H2O backend and R based parallel library:

 > library(h2o)
 > h2o.init(nthreads = -1)

 ## To simplify only use first 300 rows

 > prostate.hex = h2o.uploadFile(path = system.file("extdata", "prostate.csv", package="h2o"), destination_frame = "prostate.hex")
 > prostate.hex = prostate.hex[1:300,]
 > ones = rep(1, times = 100)
 > zeroes = rep(0, times = 100)
 > prostate.hex$Fold_1 = as.h2o(data.frame( Fold_1 = c(ones, zeroes, zeroes)))
 > prostate.hex$Fold_2 = as.h2o(data.frame( Fold_2 = c(zeroes, ones, zeroes)))
 > prostate.hex$Fold_3 = as.h2o(data.frame( Fold_3 = c(zeroes, zeroes, ones)))
 ## Case 1: Use weights in GLM that will essentially run multiple GLM models on the same frame (so no data replication)

 > glm_weights = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate.hex, weights_column = paste0("Fold_", i), model_id = paste0("Fold_", i))
 glm_weights = c(glm_weights, glm_m)
 > end = Sys.time()
 > weightsTime = end - start
 > weightsTime

 ## Case 2: Subset H2OFrame and try to run GLM in a for loop

 > prostate_1 = prostate.hex[1:100, ]
 > prostate_2 = prostate.hex[101:200, ]
 > prostate_3 = prostate.hex[201:300, ]
 > prostate = c(prostate_1,prostate_2,prostate_3)
 > glm_subset = c()
 > start = Sys.time()
 > for(i in 1:3) {
 glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i))
 glm_subset = c(glm_subset, glm_m)
 > end = Sys.time()
 > subsetTime = end - start
 > subsetTime

 ## Case 3: Use the package parallel to send all the GLM function calls over to H2O and H2O will handle how to run the multiple calls optimumly

 > library(parallel)
 > start = Sys.time()
 > glm_parallel = mclapply(1:3, function(i) 
 > glm_m = h2o.glm(x = c(3:9), y = 2, training_frame = prostate[[i]], model_id = paste0("Fold_", i)) )
 > end = Sys.time()
 > parallelTimes = end - start
 > parallelTimes

 ### Quick check to make sure all the GLM models return the same residual deviance

 > unlist(lapply(glm_parallel, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_weights, function(x) h2o.residual_deviance(x)))
 > unlist(lapply(glm_subset, function(x) h2o.residual_deviance(x)))
 ## Compare the model build time

 > comparison_table = data.frame(Time_Elapsed = c(weightsTime, subsetTime, parallelTimes), row.names = c("Case_1", "Case_2", "Case_3"))