Python sklearn 中 StratifiedKFold 和 StratifiedShuffleSplit 的区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45969390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
提问by gabboshow
As from the title I am wondering what is the difference between
从标题我想知道有什么区别
StratifiedKFoldwith the parameter shuffle = True
带有参数shuffle = True 的StratifiedKFold
StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
and
和
StratifiedShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=0)
and what is the advantage of using StratifiedShuffleSplit
以及使用 StratifiedShuffleSplit 的优势是什么
回答by Ken Syme
In KFolds, each test set should not overlap, even with shuffle. With KFolds and shuffle, the data is shuffled once at the start, and then divided into the number of desired splits. The test data is always one of the splits, the train data is the rest.
在 KFolds 中,即使使用 shuffle,每个测试集也不应该重叠。使用 KFolds 和 shuffle,数据在开始时被 shuffle 一次,然后分成所需的 splits 数。测试数据始终是拆分之一,火车数据是其余部分。
In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
在ShuffleSplit中,数据每次都是shuffle,然后split。这意味着测试集可能会在拆分之间重叠。
See this block for an example of the difference. Note the overlap of the elements in the test sets for ShuffleSplit.
有关差异的示例,请参阅此块。请注意 ShuffleSplit 测试集中元素的重叠。
splits = 5
tx = range(10)
ty = [0] * 5 + [1] * 5
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)
print("KFold")
for train_index, test_index in kfold.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
print("TRAIN:", train_index, "TEST:", test_index)
Output:
输出:
KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
As for when to use them, I tend to use KFolds for any cross validation, and I use ShuffleSplit with a split of 2 for my train/test set splits. But I'm sure there are other use cases for both.
至于何时使用它们,我倾向于使用 KFolds 进行任何交叉验证,并且我使用 ShuffleSplit 和 2 的拆分来进行训练/测试集拆分。但我确信两者都有其他用例。
回答by Catbuilts
@Ken Syme already has a very good answer. I just want to add something.
@Ken Syme 已经有了很好的答案。我只是想补充一点。
StratifiedKFold
is a variation ofKFold
. First,StratifiedKFold
shuffles your data, after that splits the data inton_splits
parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one timebefore splitting.
StratifiedKFold
是 的变体KFold
。首先,StratifiedKFold
混洗您的数据,然后将数据拆分为多个n_splits
部分并完成。现在,它将使用每个部分作为测试集。请注意,它只会在拆分前始终将数据混洗一次。
With shuffle = True
, the data is shuffled by your random_state
. Otherwise,
the data is shuffled by np.random
(as default).
For example, with n_splits = 4
, and your data has 3 classes (label) for y
(dependent variable). 4 test sets cover all the data without any overlap.
使用 shuffle = True
,数据由您的random_state
. 否则,数据将被打乱np.random
(默认)。例如,使用n_splits = 4
, 并且您的数据有 3 个类(标签)用于y
(因变量)。4个测试集覆盖了所有数据,没有任何重叠。
- On the other hand,
StratifiedShuffleSplit
is a variation ofShuffleSplit
. First,StratifiedShuffleSplit
shuffles your data, and then it also splits the data inton_splits
parts. However, it's not done yet. After this step,StratifiedShuffleSplit
picks one part to use as a test set. Then it repeats the same processn_splits - 1
other times, to getn_splits - 1
other test sets. Look at the picture below, with the same data, but this time, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.
- 另一方面,
StratifiedShuffleSplit
是 的变体ShuffleSplit
。首先,StratifiedShuffleSplit
对数据进行混洗,然后还将数据拆分为多个n_splits
部分。然而,它还没有完成。在这一步之后,StratifiedShuffleSplit
选择一个部分用作测试集。然后它在n_splits - 1
其他时间重复相同的过程,以获得n_splits - 1
其他测试集。看下图,同样的数据,但是这一次,4个测试集并没有覆盖所有的数据,即测试集之间有重叠。
So, the difference here is that StratifiedKFold
just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit
shuffles each time before splitting, and it splits n_splits
times, the test sets can overlap.
所以,这里的区别是StratifiedKFold
只混洗和分裂一次,因此测试集不重叠,而StratifiedShuffleSplit
每次分裂前混洗,并且分裂n_splits
次数,测试集可以重叠。
- Note: the two methods uses "stratified fold" (that why "stratified" appears in both names). It means each part preserves the same percentage of samples of each class (label) as the original data. You can read more at cross_validation documents
- 注意:这两种方法都使用“分层折叠”(这就是为什么“分层”出现在两个名称中的原因)。这意味着每个部分保留与原始数据相同百分比的每个类(标签)的样本。您可以在cross_validation 文档中阅读更多信息
回答by Black Raven
Pictorial representation: output examples of KFold, StratifiedKFold, StratifiedShuffleSplit( how do i show this picture in this window? )
图片表示: KFold、StratifiedKFold、StratifiedShuffleSplit 的输出示例(如何在此窗口中显示此图片?)
The above pictorial representation is based on Ken Syme's code:
上面的图形表示基于 Ken Syme 的代码:
from sklearn.model_selection import KFold, StratifiedKFold, StratifiedShuffleSplit
SEED = 43
SPLIT = 3
X_train = [0,1,2,3,4,5,6,7,8]
y_train = [0,0,0,0,0,0,1,1,1] # note 6,7,8 are labelled class '1'
print("KFold, shuffle=False (default)")
kf = KFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)
print("KFold, shuffle=True")
kf = KFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)
print("\nStratifiedKFold, shuffle=False (default)")
skf = StratifiedKFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)
print("StratifiedKFold, shuffle=True")
skf = StratifiedKFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)
print("\nStratifiedShuffleSplit")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=3)
for train_index, test_index in sss.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)
print("\nStratifiedShuffleSplit (can customise test_size)")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=2)
for train_index, test_index in sss.split(X_train, y_train):
print("TRAIN:", train_index, "TEST:", test_index)