Python 在 sklearn 中使用 RandomForestClassifier 进行不平衡分类
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20082674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Unbalanced classification using RandomForestClassifier in sklearn
提问by mlo
I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
我有一个数据集,其中的类不平衡。类为“1”或“0”,其中“1”:“0”类的比率为 5:1。您如何使用随机森林在 sklearn 中计算每个类的预测误差和相应的重新平衡权重,类似于以下链接:http: //www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#平衡
采纳答案by alko
You can pass sample weights argument to Random Forest fit method
您可以将样本权重参数传递给随机森林拟合方法
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.
样本权重。如果没有,则样本的权重相等。在每个节点中搜索拆分时,会忽略会创建净权重为零或负值的子节点的拆分。在分类的情况下,如果拆分会导致任何单个类在任一子节点中具有负权重,也将忽略拆分。
In older version there were a preprocessing.balance_weightsmethod to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weightsmodule, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
在旧版本中,有一种preprocessing.balance_weights方法可以为给定的样本生成平衡权重,从而使类均匀分布。它仍然存在,在内部但仍然可用的preprocessing._weights模块中,但已被弃用并将在未来版本中删除。不知道具体原因。
Update
更新
Some clarification, as you seems to be confused. sample_weightusage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have Xas observations and yas classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght1-d array represent weight for a corresponding (observation, label)pair. For your case, if 1class is represented 5 times as 0class is, and you balance classes distributions, you could use simple
一些澄清,因为你似乎很困惑。sample_weight用法很简单,一旦你记住它的目的是平衡训练数据集中的目标类。也就是说,如果你有X作为观察和y类(标签),那么len(X) == len(y) == len(sample_wight), 和sample witght1-d 数组的每个元素代表相应(observation, label)对的权重。对于您的情况,如果1class 表示为0class 的5 倍,并且您平衡类分布,则可以使用 simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5to all 0instances and weight of 1to all 1instances. See link above for a bit more crafty balance_weightsweights evaluation function.
重量分配的5所有0实例和重量的1所有1实例。有关更巧妙的balance_weights权重评估函数,请参阅上面的链接。
回答by Meena Mani
If the majority class is 1, and the minority class is 0, and they are in the ratio 5:1, the sample_weightarray should be:
如果多数类为 1,少数类为 0,并且它们的比例为 5:1,则sample_weight数组应为:
sample_weight = np.array([5 if i == 1 else 1 for i in y])
Note that you do not invert the ratios.This also applies to class_weights. The larger number is associated with the majority class.
请注意,您不会反转比率class_weights。这也适用于。较大的数字与多数类相关联。
回答by Anatoly Alekseev
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
sklearn 的“fit”方法不允许指定要优化的性能度量,这真的很遗憾。在解决分类任务时,当人们对数据样本调用 fit 方法时,似乎没有人理解、质疑或对实际发生的事情感兴趣。
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
我们(scikit 学习包的用户)建议间接使用具有适用于不平衡数据集的特定评分方法的交叉验证网格搜索,以期偶然发现产生适当 AUC 或 F1 分数的参数/元参数集。
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score? Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
但是想一想:看起来每次在引擎盖下调用的“fit”方法总是可以优化准确性。因此,最终,如果我们的目标是最大化 F1 分数,GridSearchCV 会为我们提供“所有模型中具有最佳 F1 的模型,并且具有最佳精度”。这不傻吗?直接优化模型参数以获得最大 F1 分数不是更好吗?记住旧的良好的 Matlab ANNs 包,您可以在其中将所需的性能指标设置为 RMSE、MAE 以及任何您想要的,因为已定义梯度计算算法。为什么 sklearn 悄悄地忽略了性能指标的选择?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then? I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
至少,为什么没有简单的选项来自动分配类实例权重来解决不平衡的数据集问题?为什么我们必须手动计算重量?此外,在许多机器学习书籍/文章中,我看到作者称赞 sklearn 的手册即使不是该主题的最佳信息来源,也非常棒。不完全是?为什么不平衡数据集问题(这显然对数据科学家来说非常重要)甚至没有在文档中提及?如果他们阅读本文,我会向 sklearn 的贡献者提出这些问题。或者任何知道这样做的原因的人欢迎发表评论并清除问题。
UPDATE
更新
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
从 scikit-learn 0.17 开始,有 class_weight='balanced' 选项,您至少可以将其传递给某些分类器:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
“平衡”模式使用 y 的值自动调整与输入数据中的类频率成反比的权重,如 n_samples / (n_classes * np.bincount(y))。
回答by negas
Use the parameter class_weight='balanced'
使用参数 class_weight='balanced'
From sklearn documentation: The balancedmode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
来自 sklearn 文档:平衡模式使用 y 的值自动调整与输入数据中的类频率成反比的权重,如n_samples / (n_classes * np.bincount(y))

