Python sklearn 中 StratifiedKFold 和 StratifiedShuffleSplit 的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45969390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:23:39  来源:igfitidea点击:

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

pythonscikit-learncross-validation

提问by gabboshow

As from the title I am wondering what is the difference between

从标题我想知道有什么区别

StratifiedKFoldwith the parameter shuffle = True

带有参数shuffle = True 的StratifiedKFold

StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

and

StratifiedShuffleSplit

分层洗牌拆分

StratifiedShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=0)

and what is the advantage of using StratifiedShuffleSplit

以及使用 StratifiedShuffleSplit 的优势是什么

回答by Ken Syme

In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.

在 KFolds 中,即使使用 shuffle,每个测试集也不应该重叠。使用 KFolds 和 shuffle,数据在开始时被 shuffle 一次,然后分成所需的 splits 数。测试数据始终是拆分之一,火车数据是其余部分。

In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.

在ShuffleSplit中,数据每次都是shuffle,然后split。这意味着测试集可能会在拆分之间重叠。

See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.

有关差异的示例,请参阅此块。请注意 ShuffleSplit 测试集中元素的重叠。

splits = 5

tx = range(10)
ty = [0] * 5 + [1] * 5

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets

kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)

print("KFold")
for train_index, test_index in kfold.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

Output:

输出:

KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]

As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.

至于何时使用它们,我倾向于使用 KFolds 进行任何交叉验证,并且我使用 ShuffleSplit 和 2 的拆分来进行训练/测试集拆分。但我确信两者都有其他用例。

回答by Catbuilts

@Ken Syme already has a very good answer. I just want to add something.

@Ken Syme 已经有了很好的答案。我只是想补充一点。

  • StratifiedKFoldis a variation of KFold. First, StratifiedKFoldshuffles your data, after that splits the data into n_splitsparts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one timebefore splitting.
  • StratifiedKFold是 的变体KFold。首先,StratifiedKFold混洗您的数据,然后将数据拆分为多个n_splits部分并完成。现在,它将使用每个部分作为测试集。请注意,它只会在拆分前始终将数据混洗一次

With shuffle = True, the data is shuffled by your random_state. Otherwise, the data is shuffled by np.random(as default). For example, with n_splits = 4, and your data has 3 classes (label) for y(dependent variable). 4 test sets cover all the data without any overlap.

使用 shuffle = True,数据由您的random_state. 否则,数据将被打乱np.random(默认)。例如,使用n_splits = 4, 并且您的数据有 3 个类(标签)用于y(因变量)。4个测试集覆盖了所有数据,没有任何重叠。

enter image description here

在此处输入图片说明

  • On the other hand, StratifiedShuffleSplitis a variation of ShuffleSplit. First, StratifiedShuffleSplitshuffles your data, and then it also splits the data into n_splitsparts. However, it's not done yet. After this step, StratifiedShuffleSplitpicks one part to use as a test set. Then it repeats the same process n_splits - 1other times, to get n_splits - 1other test sets. Look at the picture below, with the same data, but this time, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.
  • 另一方面,StratifiedShuffleSplit是 的变体ShuffleSplit。首先,StratifiedShuffleSplit对数据进行混洗,然后还将数据拆分为多个n_splits部分。然而,它还没有完成。在这一步之后,StratifiedShuffleSplit选择一个部分用作测试集。然后它在n_splits - 1其他时间重复相同的过程,以获得n_splits - 1其他测试集。看下图,同样的数据,但是这一次,4个测试集并没有覆盖所有的数据,即测试集之间有重叠。

enter image description here

在此处输入图片说明

So, the difference here is that StratifiedKFoldjust shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplitshuffles each time before splitting, and it splits n_splitstimes, the test sets can overlap.

所以,这里的区别是StratifiedKFold只混洗和分裂一次,因此测试集不重叠,而StratifiedShuffleSplit每次分裂前混洗,并且分裂n_splits次数,测试集可以重叠

  • Note: the two methods uses "stratified fold" (that why "stratified" appears in both names). It means each part preserves the same percentage of samples of each class (label) as the original data. You can read more at cross_validation documents
  • 注意:这两种方法都使用“分层折叠”(这就是为什么“分层”出现在两个名称中的原因)。这意味着每个部分保留与原始数据相同百分比的每个类(标签)的样本。您可以在cross_validation 文档中阅读更多信息

回答by Black Raven

Pictorial representation: output examples of KFold, StratifiedKFold, StratifiedShuffleSplit( how do i show this picture in this window? )

图片表示: KFold、StratifiedKFold、StratifiedShuffleSplit 的输出示例(如何在此窗口中显示此图片?)

The above pictorial representation is based on Ken Syme's code:

上面的图形表示基于 Ken Syme 的代码:

from sklearn.model_selection import KFold, StratifiedKFold, StratifiedShuffleSplit
SEED = 43
SPLIT = 3

X_train = [0,1,2,3,4,5,6,7,8]
y_train = [0,0,0,0,0,0,1,1,1]   # note 6,7,8 are labelled class '1'

print("KFold, shuffle=False (default)")
kf = KFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("KFold, shuffle=True")
kf = KFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedKFold, shuffle=False (default)")
skf = StratifiedKFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("StratifiedKFold, shuffle=True")
skf = StratifiedKFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedShuffleSplit")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=3)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedShuffleSplit (can customise test_size)")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=2)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)