Working with variable importance data with models in H2O

When building classification models in H2O, you will get to see the variable importance table at the FLOW UI. It looks like as below:

Screen Shot 2017-04-11 at 3.18.54 PM

Most of the users are using python or R as their shell so there could be a need to get this variable importance table into python or R shell. This is what we will do in next step.

If we want to plot the variable importance graph we can use the following script:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
variables = mymodel._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
scaled_importance = mymodel._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos, scaled_importance, align='center', color='green', ecolor='black')
ax.set_xlabel('Scaled Importance')
ax.set_title('Variable Importance')

Here is the variable importance graph looks like:

Screen Shot 2017-04-11 at 3.09.22 PM

If we want to see the variable metrics directly from the model in python we can do the following:


The results are shown as below:

Screen Shot 2017-04-11 at 3.13.30 PM

Thats it, enjoy!!


Variable Importance and how it is calculated?

What is variable importance (VI):

VI represents the statistical significance of each variable in the data with respect to its affect on the generated model. VI actually is each predictor ranking based on the contribution predictors make to the model. This technique helps data scientists to weed out certain predictors which are contributing to nothing instead adds time to process. Sometime user thinks a variable must contribute to the model and its VI results are very poor, feature engineering can be done to improve predictor existence.

Here is an example of Variable Importance chart and table from H2O machine learning platform:


Question: How Variable Importance is calculated? 

Answer: Variable importance is calculated by sum of the decrease in error when split by a variable. Then the relative importance is the variable importance divided by the highest variable importance value so that values are bounded between 0 and 1.

Question: Is it safe to conclude that zero relative importance means zero contribution to the model? 

Answer: With variable importance if a certain variable or a group of variables importance is shows as 0.0000 it means they’ve never split by the column. Thats why their relative importance is 0.00000 and their contribution to model will be considered zero.

Question: Is it safe to remove zero relative importance variables from the predictor set when building the model?

Answer: Yes, it is safe to remove variables with zero importance as they are contributing zero to model and taking time to process the data. Also removing these zero relative importance predictors shouldn’t deteriorate model performance.

Question: In Partial Dependency plot (PDP) char what to conclude if it is flat?

Answer: In the PDP chart, when changing the values for the variable, if it doesn’t affect the probability coming out of the model and remains flat, it is safe to assume that this particular variable doesn’t contribute to the model. Note sometimes there is very small difference in variables i.e. 0.210000 and 0.210006 which is hard to find unless you scan all predictors and plot another chart by removing all top important variables to highlight very small changes. Overall you can experiment the tail predictors importance by keeping in and out of your model building step to see how it changes and if that is of any significant.