Python scikit-learn:随机森林 class_weight 和 sample_weight 参数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30805192/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
scikit-learn: Random forest class_weight and sample_weight parameters
提问by user36047
I have a class imbalance problem and been experimenting with a weighted Random Forest using the implementation in scikit-learn (>= 0.16).
我有一个类不平衡问题,并且一直在使用 scikit-learn (>= 0.16) 中的实现来试验加权随机森林。
I have noticed that the implementation takes a class_weightparameter in the tree constructor and sample_weightparameter in the fit method to help solve class imbalance. Those two seem to be multiplied though to decide a final weight.
我注意到该实现在树构造函数中采用了class_weight参数,在 fit 方法中采用了sample_weight参数来帮助解决类不平衡问题。这两者似乎相乘以决定最终权重。
I have trouble understanding the following:
我无法理解以下内容:
- In what stages of the tree construction/training/prediction are those weights used? I have seen some papers for weighted trees, but I am not sure what scikit implements.
- What exactly is the difference between class_weight and sample_weight?
- 在树构建/训练/预测的哪个阶段使用这些权重?我看过一些关于加权树的论文,但我不确定 scikit 实现了什么。
- class_weight 和 sample_weight 到底有什么区别?
采纳答案by Andreus
RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting:
RandomForests 建立在 Trees 之上,有很好的文档记录。检查 Trees 如何使用样本权重:
- User guide on decision trees- tells exactly what algorithm is used
- Decision tree API- explains how sample_weight is used by trees (which for random forests, as you have determined, is the product of class_weight and sample_weight).
- 决策树用户指南- 准确说明使用的算法
- 决策树 API- 解释了树如何使用 sample_weight(对于随机森林,正如您所确定的,它是 class_weight 和 sample_weight 的乘积)。
As for the difference between class_weight
and sample_weight
: much can be determined simply by the nature of their datatypes. sample_weight
is 1D array of length n_samples
, assigning an explicit weight to each example used for training. class_weight
is either a dictionary of each class to a uniform weight for that class (e.g., {1:.9, 2:.5, 3:.01}
), or is a string telling sklearn how to automatically determine this dictionary.
至于class_weight
和之间的区别sample_weight
:可以简单地由它们的数据类型的性质来确定。sample_weight
是 length 的一维数组,n_samples
为每个用于训练的示例分配一个明确的权重。class_weight
要么是每个类的字典到该类的统一权重(例如,{1:.9, 2:.5, 3:.01}
),要么是一个字符串,告诉 sklearn 如何自动确定这个字典。
So the training weight for a given example is the product of it's explicitly named sample_weight
(or 1
if sample_weight
is not provided), and it's class_weight
(or 1
if class_weight
is not provided).
因此,给定示例的训练权重是它被明确命名sample_weight
(或者1
如果sample_weight
没有提供)和它class_weight
(或者1
如果class_weight
没有提供)的乘积。