Tuning parameters in GBM for best modeling

gbm-tree-infographic.png

If someone is trying to build GMB, here is practical guideline of ranges on following parameters:

Tree Depth: With GBMs, a good tree depth starting point is around 6. It’s unusual to have deeper than 10. More than about 20 would be incredibly problem specific. Some problems do well with a large number of shallow trees.

Expert Advice:

 I usually start by trying 5, 7, and 11. If 5 is the best then I'll try lower numbers; if 11 is the best then I'll keep increasing (17, 23, etc.) until it stops improving. However, I'd say that about 95% of the time you won't need to go past a depth of 13. Time can also be a consideration here as well -- more depth = longer training times.

Learning Rate: The learning rate is related to the number of trees. The more trees, the lower you can make the learning rate. A starting rule of thumb is (1 / number of trees). I would consider 5000 trees to be on the high side. 100 is not unusual.

Expert Advice:

 I always use early stopping, so I don't really take the number of trees into consideration when choosing my learning rate. If you're not using early stopping the the rule of thumb that Tom provided is good. However, I would highly recommend using early stopping along with cross validation if you aren't using it already. I usually start with a learning rate of 0.05 or 0.10 (depending on size of the dataset and how long I'm willing to train for). This is just for tuning the other parameters i.e. getting them in the right vicinity of what is optimal. Things like optimum tree depth are the about the same whether the learning rate is 0.10 or 0.01, so there's no sense training for longer periods of time with a lower learning rate. Once I have my parameters set I will then lower the learning rate to 0.01 or 0.001 (again, this depends on how long you're willing to train for + how much improvement you'll get from a lower learning rate -- problem dependent). Now I might start doing some more fine tuning with the lower rate (seeing if tree depth should be 9, 10, or 11, for example). 

col_sample_rate: Start with a col_sample_rate of around 0.7.

Expert Advice:

  1.  I start with 0.7, then try 0.3, if 0.3 is worse then I try 1.0. For most problems I've encountered, that's usually good enough. If you find 0.3 is best you can of course try 0.2 and 0.4, but I tend not to get any more precise than the nearest tenth e.g. I rarely use numbers like 0.35 or 0.73 -- tuning too granularly tends to lead to overfitting.

min_split_improvment: Start with default default (0.00001) to under 0.0005.

Expert Advice:

 I haven't used this on a large number of problems, but I find the default (0.00001) or just setting it to 0 generally works well. If I see that my training loss is much lower than my validation loss then I will sometimes increase this parameter as it will help keep model from overfitting. I also believe it will be more effective when tree depth is large... deeper trees are more likely to make splits with small improvements. All of the splits in shallow trees tend to have high improvement so the minimum will always been exceeded. I don't think I've ever set this value higher than 0.0005.

Here is a pointer to the detailed parameter documentation for GBM:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html

GBM Parameters:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/parameters.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s