Feature Binning in GBM models and prediction

We have a categorical variable with about 900 levels. Out of these levels some of them happen very rarely. Out of millions of observations there are levels that would occur less than 500 times. The minimum number of rows chosen to fit the model was 500. In addition, we didn’t want to restrict the number of bins, so we chose to go with the default nbins_cats=1024.

I had a look at the POJO and for that categorical variable we got this part of the script:

// The class representing column C11 class GBM_model_python_1_ColInfo_99 implements java.io.Serializable { public static final String[] VALUES = new String[406]; static { GBM_model_python1_ColInfo_99_0.fill(VALUES); } static final class GBM_model_python*_1_ColInfo_99_0 implements java.io.Serializable { static final void fill(String[] sa) { sa[0] = "C003"; … sa[405] = "C702"; } } }

As you can see, in the script there are only 406 bins created. My questions are the following:

Question 1: What happens with the other bins up to 900? If when scoring the model we have a value for variable C99 from the bins not included in the 406, will that observation be scored? And how exactly variable C99 will be used when scoring the model?

Question 2: If the model will see for variable C99 a value outside the 900 distinct levels we had in our training data, will the model score that observation? If yes, how?

Yes the H2O model (whether it be in POJO / MOJO form or running in H2O ) , will score when receiving unseen categoricals.

 

In GBM, during training, the optimal split direction for every feature value (numeric and categorical, including missing values/NAs) is computed, and stored for future use during scoring.

This means that missing numeric, categorical, or unseen categorical values are turned into NAs.

Specifically, if there are NO NAs in the training data, then NAs in the test data follow the majority direction (the direction with the most observations); If there are NAs in the training data, then NAs in the test set follow the direction that is optimal for the NAs of the training data.

For more details we have an updated GBM documentation page at:

http://h2o-release.s3.amazonaws.com/h2o/master/3754/docs-website/h2o-docs/data-science/gbm-faq/missing_values.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s