Python 具有不平衡类的 k 折分层交叉验证

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32615429/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:54:53  来源:igfitidea点击:

k-fold stratified cross-validation with imbalanced classes

pythonmachine-learningscikit-learn

提问by eleanora

I have data with 4 classes and I am trying to build a classifier. I have ~1000 vectors for one class, ~10^4 for another, ~10^5 for the third and ~10^6 for the fourth. I was hoping to use cross-validation so I looked at the scikit-learn docs.

我有 4 个类的数据,我正在尝试构建一个分类器。我有~1000 个向量用于一个类,~10^4 用于另一个,~10^5 用于第三类,~10^6 用于第四类。我希望使用交叉验证,所以我查看了scikit-learn 文档

My first try was to use StratifiedShuffleSplitbut this gives the same percentage for each class, leaving the classes drastically imbalanced still.

我的第一次尝试是使用StratifiedShuffleSplit但这为每个类提供相同的百分比,使类仍然严重不平衡。

Is there a way to do cross-validation but with the classes balanced in the training and test set?

有没有办法进行交叉验证,但在训练和测试集中平衡类?



As a side note, I couldn't work out the difference between StratifiedShuffleSplitand StratifiedKFold. The descriptions look very similar to me.

作为旁注,我无法弄清楚StratifiedShuffleSplitStratifiedKFold之间的区别。这些描述看起来与我非常相似。

采纳答案by IVlad

My first try was to use StratifiedShuffleSplit but this gives the same percentage for each class, leaving the classes drastically imbalanced still.

我的第一次尝试是使用 StratifiedShuffleSplit 但这为每个类提供相同的百分比,使类仍然严重不平衡。

I get the feeling that you're confusing what a stratified strategy will do, but you'll need to show your code and your results to say for sure what's going on (the same percentage as their percentage in the original set, or the same percentage within the returned train / test set? The first one is how it's supposed to be).

我觉得您对分层策略的作用感到困惑,但是您需要显示您的代码和结果以确保发生了什么(与原始集合中的百分比相同,或相同的百分比)返回的训练/测试集中的百分比?第一个是它应该如何)。

As a side note, I couldn't work out the difference between StratifiedShuffleSplit and StratifiedKFold . The descriptions look very similar to me.

作为旁注,我无法弄清楚 StratifiedShuffleSplit 和 StratifiedKFold 之间的区别。这些描述看起来与我非常相似。

One of these should definitely work. The description of the first one is definitely a little confusing, but here's what they do.

其中之一肯定可以工作。第一个的描述肯定有点令人困惑,但这是他们所做的。

StratifiedShuffleSplit

分层洗牌拆分

Provides train/test indices to split data in train test sets.

提供训练/测试索引以拆分训练测试集中的数据。

This means that it splits your data into a train and test set. The stratified part means that percentages will be maintained in this split. So if 10%of your data is in class 1 and 90%is in class 2, this will ensure that 10%of your train set will be in class 1 and 90%will be in class 2. Same for the test set.

这意味着它将您的数据拆分为训练集和测试集。分层部分意味着在此拆分中将保持百分比。因此,如果10%您的数据属于第 1 类并且90%属于第 2 类,这将确保10%您的训练集属于第 1 类并且90%属于第 2 类。测试集也一样。

Your post makes it sound like you'd want 50%of each class in the test set. That isn't what stratification does, stratification maintains the original percentages. You should maintain them, because otherwise you'll give yourself an irrelevant idea about the performance of your classifier: who cares how well it classified a 50/50split, when in practice you'll see 10/90splits?

您的帖子听起来像是您想要50%测试集中的每个班级。那不是分层所做的,分层保持原始百分比。你应该维护它们,否则你会给自己一个关于分类器性能的无关紧要的想法:谁在乎它对50/50分裂的分类有多好,实际上什么时候你会看到10/90分裂?

StratifiedKFold

分层KFold

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

这个交叉验证对象是 KFold 的一种变体,它返回分层折叠。通过保留每个类别的样本百分比来进行折叠。

See k-fold cross validation. Without stratification, it just splits your data into kfolds. Then, each fold 1 <= i <= kis used once as the test set, while the others are used for training. The results are averaged in the end. It's similar to running the ShuffleSplitktimes.

请参阅k 折交叉验证。没有分层,它只是将您的数据分成几部分k。然后,每个折叠1 <= i <= k用作测试集一次,而其他折叠用于训练。结果最后取平均值。这类似于运行ShuffleSplitk时间。

Stratification will ensure that the percentages of each class in your entire data will be the same (or very close to) within each individual fold.

分层将确保整个数据中每个类别的百分比在每个单独的折叠中相同(或非常接近)。



There is a lot of literature that deals with imbalanced classes. Some simple to use methods involve using class weights and analysis the ROC curve. I suggest the following resources for starting points on this:

有很多文献涉及不平衡的类。一些简单易用的方法涉及使用类权重和分析 ROC 曲线。我建议使用以下资源作为起点:

  1. A scikit-learn example of using class weights.
  2. A quora question about implementing neural networks for imbalanced data.
  3. This stats.stackexchange question with more in-depth answers.
  1. 使用类权重的 scikit-learn 示例
  2. 关于为不平衡数据实施神经网络的 quora 问题
  3. 这个 stats.stackexchange 问题有更深入的答案

回答by Jon

K-Fold CV

K折简历

K-Fold CV works by randomly partitioning your data into k(fairly) equal partitions. If your data were evenly balanced across classes like [0,1,0,1,0,1,0,1,0,1], randomly sampling with (or without replacement) will give you approximately eqal sample sizes of 0and 1.

K-Fold CV 的工作原理是将您的数据随机划分为k(相当)相等的分区。如果您的数据在诸如 的类之间均匀平衡[0,1,0,1,0,1,0,1,0,1],则使用(或不使用替换)随机抽样将为您提供大约相等的样本大小01

However, if your data is more like [0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0]where one class over represents the data, k-fold cv without weighted sampling would give you erroneous results.

但是,如果您的数据更像 [0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0]是一类代表数据的地方,那么没有加权采样的 k 折 cv 会给您错误的结果。

If you use ordinary k-fold CV without adjusting sampling weights from uniform sampling, then you'd obtain something like

如果您使用普通的 k 折 CV 而不调整均匀采样的采样权重,那么您将获得类似

## k-fold CV
k = 5
splits = np.array_split(y, k)
for i in range(k):
    print(np.mean(splits[i]))

 [array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1])]

where there are clearly splits without useful representation of both classes.

在没有两个类的有用表示的情况下,存在明显的分裂。

The point of k-fold CV is to train/test a model across all subsets of data, while at each trial leaving out 1 subset and training on k-1 subsets.

k-fold CV 的重点是在所有数据子集上训练/测试模型,而在每次试验中,省略 1 个子集并在 k-1 个子集上进行训练。

In this scenario, you'd want to use split by strata. In the above data set, there are 27 0sand 5 1s. If you'd like to compute k=5 CV, it wouldn't be reasonable to split the strata of 1into 5 subsets. A better solution is to split it into k < 5 subsets, such as 2. The strata of 0scan remain with k=5 splits since it's much larger. Then while training, you'd have a simple product of 2 x 5from the data set. Here is some code to illustrate

在这种情况下,您希望使用按层分割。在上面的数据集中,有 270s和 5 1s。如果您想计算 k=5 CV,将 的层1分成 5 个子集是不合理的。更好的解决方案是将其拆分为 k < 5 个子集,例如 2。 的层0s可以保持 k=5 拆分,因为它要大得多。然后在训练时,您将获得2 x 5来自数据集的简单乘积。这是一些代码来说明

from itertools import product

for strata, iterable in groupby(y):
    data = np.array(list(iterable))
    if strata == 0:
        zeros = np.array_split(data, 5)
    else:
        ones = np.array_split(data, 2)


cv_splits = list(product(zeros, ones))
print(cv_splits)

m = len(cv_splits)
for i in range(2):
    for j in range(5):
        data = np.concatenate((ones[-i+1], zeros[-j+1]))
        print("Leave out ONES split {}, and Leave out ZEROS split {}".format(i,j))
        print("train on: ", data)
        print("test on: ", np.concatenate((ones[i], zeros[j])))



Leave out ONES split 0, and Leave out ZEROS split 0
train on:  [1 1 0 0 0 0 0 0]
test on:  [1 1 1 0 0 0 0 0 0]
Leave out ONES split 0, and Leave out ZEROS split 1
train on:  [1 1 0 0 0 0 0 0]
...
Leave out ONES split 1, and Leave out ZEROS split 4
train on:  [1 1 1 0 0 0 0 0]
test on:  [1 1 0 0 0 0 0]

This method can accomplish splitting the data into partitions where all partitions are eventually left out for testing. It should be noted that not all statistical learning methods allow for weighting, so adjusting methods like CV is essential to account for sampling proportions.

这种方法可以完成将数据拆分为分区,其中所有分区最终都被排除在外进行测试。需要注意的是,并不是所有的统计学习方法都允许加权,所以像 CV 这样的调整方法对于考虑采样比例是必不可少的。

  • Reference: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R.
  • 参考资料:James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013)。统计学习简介:在 R 中的应用。