Python sklearn train_test_split on pandas 按多列分层
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45516424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
sklearn train_test_split on pandas stratify by multiple columns
提问by Caitlin
I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe.
我是 sklearn 的一个相对较新的用户,并且在 sklearn.model_selection 的 train_test_split 中遇到了一些意外行为。我有一个 Pandas 数据框,我想将它分成训练集和测试集。我想将我的数据按至少 2 列分层,但最好在我的数据框中按 4 列分层。
There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior:
当我尝试这样做时,sklearn 没有发出警告,但是我后来发现我的最终数据集中有重复的行。我创建了一个示例测试来显示这种行为:
from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
It seems to work as expected if I stratify by either column:
如果我按任一列分层,它似乎按预期工作:
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 800000
But when I try to stratify by both columns, I get repeated values:
但是当我尝试按两列进行分层时,我得到了重复的值:
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values)) # prints 800000
print(len(set(train.a.values))) # prints 640000
回答by andrew_reece
The reason you're getting duplicates is because train_test_split()
eventually defines strata as the unique set of valuesof whatever you passed into the stratify
argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.
你得到重复的原因是因为train_test_split()
最终将层定义为你传递给参数的任何值的唯一值集stratify
。由于层是由两列定义的,一行数据可能代表多个层,因此抽样可能会选择同一行两次,因为它认为它是从不同的类中抽样的。
The train_test_split()
function callsStratifiedShuffleSplit
, which usesnp.unique()
on y
(which is what you pass in via stratify
). From the source code:
该train_test_split()
函数调用StratifiedShuffleSplit
,它使用np.unique()
上y
(这是你在通过传递什么stratify
)。从源代码:
classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]
Here's a simplified sample case, a variation on the example you provided:
这是一个简化的示例案例,是您提供的示例的变体:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
print(df)
a b c
0 0 bar y
1 1 foo y
2 2 bar z
3 3 bar y
4 4 foo z
5 5 bar y
...
The stratification function thinks there are four classes to split on: foo
, bar
, y
, and z
. But since these classes are essentially nested, meaning y
and z
both show up in b == foo
and b == bar
, we'll get duplicates when the splitter tries to sample from each class.
分层功能认为有四类拆就:foo
,bar
,y
,和z
。但是由于这些类本质上是嵌套的,这意味着y
和z
都出现在b == foo
和 中b == bar
,当拆分器尝试从每个类中采样时,我们会得到重复项。
train, test = train_test_split(df, test_size=0.2, random_state=0,
stratify=df[['b', 'c']])
print(len(train.a.values)) # 16
print(len(set(train.a.values))) # 12
print(train)
a b c
3 3 bar y # selecting a = 3 for b = bar*
5 5 bar y
13 13 foo y
4 4 foo z
14 14 bar z
10 10 foo z
3 3 bar y # selecting a = 3 for c = y
6 6 bar y
16 16 foo y
18 18 bar z
6 6 bar y
8 8 foo y
18 18 bar z
7 7 bar z
4 4 foo z
19 19 bar y
#* We can't be sure which row is selecting for `bar` or `y`,
# I'm just illustrating the idea here.
There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.b
and df.c
as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_split
is set up to do.
还有一个更大的设计问题在这里:你想要二手嵌套的分层抽样,还是你其实只是想对待每类df.b
和df.c
从一个单独的类来样?如果是后者,那就是你已经得到的。前者更复杂,这不是train_test_split
设置要做的。
You might find this discussionof nested stratified sampling useful.
回答by Sesquipedalism
If you want train_test_split
to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.
如果您想train_test_split
按预期运行(按多个列进行分层,没有重复项),请创建一个新列,该列是其他列中的值的串联,并在新列上分层。
df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
If you're worried about collision due to values like 11
and 3
and 1
and 13
both creating a concatenated value of 113
, then you can add some arbitrary string in the middle:
如果您担心由于11
and3
和1
and 之类的值而13
产生的碰撞,并且两者都会创建 的连接值113
,那么您可以在中间添加一些任意字符串:
df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)
回答by Louis T
What version of scikit-learn are you using ? You can use sklearn.__version__
to check.
您使用的是什么版本的 scikit-learn?可以sklearn.__version__
用来检查。
The prior to version 0.19.0, scikit-learn does not handle 2-dimensional stratification correctly. It is patched in 0.19.0.
在 0.19.0 版本之前,scikit-learn 不能正确处理二维分层。它在 0.19.0 中进行了修补。
It is describled in issue #9044.
它在issue #9044 中有描述。
Update your scikit-learn should fix the problem. If you can't update your scikit-learn, see this commit history herefor the fix.
更新您的 scikit-learn 应该可以解决问题。如果您无法更新 scikit-learn,请在此处查看此提交历史以获取修复。