Python 如何使用 Scikit Learn 调整随机森林中的参数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36107820/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to tune parameters in Random Forest, using Scikit Learn?
提问by O.rka
class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
criterion='gini',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False,
class_weight=None)
I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.
我正在使用具有 9 个样本和大约 7000 个属性的随机森林模型。在这些样本中,我的分类器可以识别 3 个类别。
I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?
我知道这远非理想条件,但我试图找出哪些属性在特征预测中最重要。哪些参数最适合调整以优化特征重要性?
I tried different n_estimators
and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_
array) increased dramatically.
我尝试了不同的方法n_estimators
并注意到“重要特征”(即feature_importances_
数组中的非零值)的数量急剧增加。
I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.
我已经通读了文档,但如果有人对此有任何经验,我想知道哪些参数最适合调整并简要说明原因。
回答by Randy Olson
From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:
根据我的经验,sklearn RandomForestClassifier 有三个值得探索的功能,按重要性排序:
n_estimators
max_features
criterion
n_estimators
max_features
criterion
n_estimators
is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.
n_estimators
真的不值得优化。你给它的估计量越多,它的效果就越好。500 或 1000 通常就足够了。
max_features
is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.
max_features
值得探索许多不同的值。它可能对 RF 的行为产生很大影响,因为它决定了 RF 中的每棵树在每次拆分时考虑的特征数量。
criterion
may have a small impact, but usually the default is fine. If you have the time, try it out.
criterion
可能会产生很小的影响,但通常默认值就可以了。如果你有时间,不妨试一试。
Make sure to use sklearn's GridSearch(preferably GridSearchCV, but your data set size is too small) when trying out these parameters.
在尝试这些参数时,请确保使用 sklearn 的GridSearch(最好是 GridSearchCV,但您的数据集大小太小)。
If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.
但是,如果我正确理解您的问题,您只有 9 个样本和 3 个班级?大概每班3个样本?除非它们是良好的、具有代表性的记录,否则您的 RF 很有可能会因少量数据而过度拟合。
回答by lejlot
The crucial parts are usually three elements:
关键部分通常是三个要素:
- number of estimators- usually bigger the forest the better, there is small chance of overfitting here
- max depth of each tree(default none, leading to full tree) - reduction of the maximum depth helps fighting with overfitting
- max features per split(default
sqrt(d)
) - you might one to play around a bit as it significantly alters behaviour of the whole tree. sqrt heuristic is usually a good starting point but an actual sweet spot might be somewhere else
- 估计器的数量- 通常森林越大越好,这里过拟合的可能性很小
- 每棵树的最大深度(默认为无,导致完整的树) - 最大深度的减少有助于对抗过度拟合
- 每个分割的最大特征(默认
sqrt(d)
) - 您可能会玩一些,因为它会显着改变整个树的行为。sqrt 启发式通常是一个很好的起点,但实际的最佳位置可能在其他地方
回答by Anant Gupta
回答by Liu Bei
n_estimators
is good one as others said. It is also good at dealing with the overfitting when increasing it.
n_estimators
正如其他人所说的那样是好的。增加时也能很好地处理过拟合。
But I think min_sample_split
is also helpful when dealing with overfitting occurred in a small-sample but big-features dataset.
但我认为min_sample_split
在处理小样本但大特征数据集中发生的过度拟合时也很有帮助。