Creating Partial Dependency Plot (PDP) in H2O

Starting from H2O H2O added partial dependency plot which has the Java backend to do the mutli-scoring of the dataset with the model. This makes creating PDP much faster.

To get PDP in H2O you must need Model, and the original data set used to generate mode. Here are few ways to create PDP:

If you want to generate PDP on a single column:

response = h2o.predict(model, data.pdp[, column_name])
To generate PDP on the original data set:
response = h2o.predict(model, data.pdp)
If you want to build PDP directly from Model and dataset without using PDP API, you can the following code:
model = prostate.gbm
column_name = "AGE"
data.pdp = data.hex
bins = unique(h2o.quantile(data.hex[, column_name], probs = seq(0.05,1,0.05)) )
mean_responses = c()

for(bin in bins ){
  data.pdp[, column_name] = bin
  response = h2o.predict(model, data.pdp[, column_name])
  mean_response = mean(response[,ncol(response)])
  mean_responses = c(mean_responses, mean_response)

pdp_manual = data.frame(AGE = bins, mean_response = mean_responses)
plot(pdp_manual, type = "l")
Thats it, enjoy!!

Variable Importance and how it is calculated?

What is variable importance (VI):

VI represents the statistical significance of each variable in the data with respect to its affect on the generated model. VI actually is each predictor ranking based on the contribution predictors make to the model. This technique helps data scientists to weed out certain predictors which are contributing to nothing instead adds time to process. Sometime user thinks a variable must contribute to the model and its VI results are very poor, feature engineering can be done to improve predictor existence.

Here is an example of Variable Importance chart and table from H2O machine learning platform:


Question: How Variable Importance is calculated? 

Answer: Variable importance is calculated by sum of the decrease in error when split by a variable. Then the relative importance is the variable importance divided by the highest variable importance value so that values are bounded between 0 and 1.

Question: Is it safe to conclude that zero relative importance means zero contribution to the model? 

Answer: With variable importance if a certain variable or a group of variables importance is shows as 0.0000 it means they’ve never split by the column. Thats why their relative importance is 0.00000 and their contribution to model will be considered zero.

Question: Is it safe to remove zero relative importance variables from the predictor set when building the model?

Answer: Yes, it is safe to remove variables with zero importance as they are contributing zero to model and taking time to process the data. Also removing these zero relative importance predictors shouldn’t deteriorate model performance.

Question: In Partial Dependency plot (PDP) char what to conclude if it is flat?

Answer: In the PDP chart, when changing the values for the variable, if it doesn’t affect the probability coming out of the model and remains flat, it is safe to assume that this particular variable doesn’t contribute to the model. Note sometimes there is very small difference in variables i.e. 0.210000 and 0.210006 which is hard to find unless you scan all predictors and plot another chart by removing all top important variables to highlight very small changes. Overall you can experiment the tail predictors importance by keeping in and out of your model building step to see how it changes and if that is of any significant.