如何解决 Python sklearn 随机森林中的过度拟合?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20463281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I solve overfitting in random forest of Python sklearn?
提问by Munichong
I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations:
我正在使用在 python sklearn 包中实现的 RandomForestClassifier 来构建二进制分类模型。下面是交叉验证的结果:
Fold 1 : Train: 164 Test: 40
Train Accuracy: 0.914634146341
Test Accuracy: 0.55
Fold 2 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.707317073171
Fold 3 : Train: 163 Test: 41
Train Accuracy: 0.889570552147
Test Accuracy: 0.585365853659
Fold 4 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.756097560976
Fold 5 : Train: 163 Test: 41
Train Accuracy: 0.883435582822
Test Accuracy: 0.512195121951
I am using "Price" feature to predict "quality" which is a ordinal value. In each cross validation, there are 163 training examples and 41 test examples.
我正在使用“价格”功能来预测“质量”,这是一个序数值。在每个交叉验证中,有 163 个训练样例和 41 个测试样例。
Apparently, overfitting occurs here. So is there any parameters provided by sklearn can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.
显然,这里发生了过度拟合。那么有没有sklearn提供的参数可以用来解决这个问题呢?我在这里找到了一些参数,例如 min_samples_split 和 min_sample_leaf,但我不太明白如何调整它们。
Thanks in advance!
提前致谢!
回答by Simon
I would agree with @Falcon w.r.t. the dataset size. It's likely that the main problem is the small size of the dataset. If possible, the best thing you can do is get more data, the more data (generally) the less likely it is to overfit, as random patterns that appear predictive start to get drowned out as the dataset size increases.
我同意@Falcon wrt 数据集大小。主要问题很可能是数据集的小尺寸。如果可能,您能做的最好的事情就是获取更多数据,数据越多(通常)过拟合的可能性就越小,因为随着数据集大小的增加,出现预测的随机模式开始被淹没。
That said, I would look at the following params:
也就是说,我会查看以下参数:
- n_estimators: @Falcon is wrong, in general the more trees the less likelythe algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
- max_features: try reducing this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
- max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.
- min_samples_leaf: Try setting this to values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.
- n_estimators:@Falcon 是错误的,一般来说,树越多,算法过拟合的可能性就越小。所以尝试增加这个。这个数字越小,模型越接近决策树,具有有限的特征集。
- max_features:尝试减少这个数字(尝试减少特征数量的 30-50%)。这决定了每棵树随机分配了多少特征。越小,过拟合的可能性就越小,但太小会开始引入欠拟合。
- max_depth:试验一下。这将降低学习模型的复杂性,降低过度拟合的风险。尝试从小处开始,比如 5-10,然后增加你会得到最好的结果。
- min_samples_leaf:尝试将其设置为大于 1 的值。这与 max_depth 参数具有类似的效果,这意味着一旦叶子每个具有该数量的样本,分支将停止分裂。
Note when doing this work to be scientific. Use 3 datasets, a training set, a separate 'development' dataset to tweak your parameters, and a test set that tests the final model, with the optimal parameters. Only change one parameter at a time and evaluate the result. Or experiment with the sklearn gridsearch algorithm to search across these parameters all at once.
注意做这项工作时要科学。使用 3 个数据集、一个训练集、一个单独的“开发”数据集来调整您的参数,以及一个使用最佳参数测试最终模型的测试集。一次只更改一个参数并评估结果。或者尝试使用 sklearn gridsearch 算法一次搜索所有这些参数。

