Python 如何使用 Scikit Learn 调整随机森林中的参数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36107820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:25:08  来源:igfitidea点击:

How to tune parameters in Random Forest, using Scikit Learn?

pythonparametersmachine-learningscikit-learnrandom-forest

提问by O.rka

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
                                              criterion='gini', 
                                              max_depth=None,
                                              min_samples_split=2,
                                              min_samples_leaf=1, 
                                              min_weight_fraction_leaf=0.0, 
                                              max_features='auto', 
                                              max_leaf_nodes=None, 
                                              bootstrap=True, 
                                              oob_score=False,
                                              n_jobs=1, 
                                              random_state=None,
                                              verbose=0, 
                                              warm_start=False, 
                                              class_weight=None)

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

我正在使用具有 9 个样本和大约 7000 个属性的随机森林模型。在这些样本中,我的分类器可以识别 3 个类别。

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

我知道这远非理想条件,但我试图找出哪些属性在特征预测中最重要。哪些参数最适合调整以优化特征重要性?

I tried different n_estimatorsand noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_array) increased dramatically.

我尝试了不同的方法n_estimators并注意到“重要特征”(即feature_importances_数组中的非零值)的数量急剧增加。

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

我已经通读了文档,但如果有人对此有任何经验,我想知道哪些参数最适合调整并简要说明原因。

回答by Randy Olson

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

根据我的经验,sklearn RandomForestClassifier 有三个值得探索的功能,按重要性排序:

  • n_estimators

  • max_features

  • criterion

  • n_estimators

  • max_features

  • criterion

n_estimatorsis not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

n_estimators真的不值得优化。你给它的估计量越多,它的效果就越好。500 或 1000 通常就足够了。

max_featuresis worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

max_features值得探索许多不同的值。它可能对 RF 的行为产生很大影响,因为它决定了 RF 中的每棵树在每次拆分时考虑的特征数量。

criterionmay have a small impact, but usually the default is fine. If you have the time, try it out.

criterion可能会产生很小的影响,但通常默认值就可以了。如果你有时间,不妨试一试。

Make sure to use sklearn's GridSearch(preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

在尝试这些参数时,请确保使用 sklearn 的GridSearch(最好是 GridSearchCV,但您的数据集大小太小)。

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

但是,如果我正确理解您的问题,您只有 9 个样本和 3 个班级?大概每班3个样本?除非它们是良好的、具有代表性的记录,否则您的 RF 很有可能会因少量数据而过度拟合。

回答by lejlot

The crucial parts are usually three elements:

关键部分通常是三个要素:

  • number of estimators- usually bigger the forest the better, there is small chance of overfitting here
  • max depth of each tree(default none, leading to full tree) - reduction of the maximum depth helps fighting with overfitting
  • max features per split(default sqrt(d)) - you might one to play around a bit as it significantly alters behaviour of the whole tree. sqrt heuristic is usually a good starting point but an actual sweet spot might be somewhere else
  • 估计器的数量- 通常森林越大越好,这里过拟合的可能性很小
  • 每棵树的最大深度(默认为无,导致完整的树) - 最大深度的减少有助于对抗过度拟合
  • 每个分割的最大特征(默认sqrt(d)) - 您可能会玩一些,因为它会显着改变整个树的行为。sqrt 启发式通常是一个很好的起点,但实际的最佳位置可能在其他地方

回答by Anant Gupta

Thiswonderful article has a detailed explanation of tunable parameters, how to track performance vs speed trade-off, some practical tips, and how to perform grid-search.

这篇精彩的文章详细解释了可调参数、如何跟踪性能与速度的权衡、一些实用技巧以及如何执行网格搜索。

回答by Liu Bei

n_estimatorsis good one as others said. It is also good at dealing with the overfitting when increasing it.

n_estimators正如其他人所说的那样是好的。增加时也能很好地处理过拟合。

But I think min_sample_splitis also helpful when dealing with overfitting occurred in a small-sample but big-features dataset.

但我认为min_sample_split在处理小样本但大特征数据集中发生的过度拟合时也很有帮助。