Treatment of categorical variables in H2O’s DRF algorithm


In DRF categorical_encoding is exposed and the explanation i here:

Question: What is the meaning of AUTO (let the algorithm decide) in DRF?

Answer: Based on the link from our source:

GBM/DRF/K-Means: auto or AUTO: Allow the algorithm to decide (default). For GBM, DRF, and K-Means, the algorithm will perform Enum encoding when auto option is specified.

Question: Could you explain how eigen encoding works, i.e. have you a good online reference?

Answer : eigen or Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only Eigen uses k=1 only for now

Question: Are there any recommended techniques for randomising the ordering of the categoricals? Let’s say that the categoricals are US States and that large discriminative power comes from separating Alabama and Alaska, but no discrimation comes from separating {AL, AK} from the rest. With nbins_cat set to 5, say, (a compromise across all categoricals) it is likely that the grouping for {AL, AK} vs the other states won’t ever be selected, hence AL will never been considered separate to AK. Obviously in this particular case we can engineer the data, but in general this could be problem.

Following the link you give, the docs say

enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits – either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split.

I wonder what is meant by via ‘bitsets that do a perfect group split’? I have noticed this by examining the model POJO output, though I cannot find this behaviour documented. If the categories are letters of the English alphabet, a=1,…,z=26, then I’ve noticed that groups can be split by appropriate bags of letters (e.g. a split might send a, e, f, x one way with other letters and NA going the other way). Clearly it cannot be doing an exhaustive search over all possible combinations of letters a-z to form the optimal group. But neither is it only looking at the ordinates.

Answer: If nbins_cat is 5 for 52 categorical levels, then there won’t be any bitsets used for splitting the categorical levels. Instead, nbins_cats (5) split points will be considered for splitting the levels into

  • {A … D} Left vs {E … Z} Right
  • {A … M} Left vs {N … Z} Right

The 5 split points are uniformly spaced across all levels present in the node (at the root, that’s A … Z) – those are simple “less-than” splits in the integer space of the levels.

If one of the nbins_cats splits ends up being the best split for the given tree node (across all selected columns of the data), then the next level split decision will have fewer levels to split, and so on. For example, the left node might contain only {A … D} (asuming the first split point was chosen above).

This smaller set of levels will be able to be resolved with nbins_cats = 5, an then a bitset split is created, that looks like this:

  • A : Left
  • B : Right
  • C : Right
  • D : Left

Yes, this is optimal (for every level, we know the training data behavior and can choose to send the points left or right), but without doing an exhaustive search.

The point here is that nbins_cats is an important tuning parameter as it will lead to “perfect” splits once it’s big enough to resolve the categorical levels. Otherwise, you have to hope that the “less-than” splits will lead to good-enough separation to eventually get to the perfect bitsets.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s