Python sklearn train_test_split on pandas 按多列分层

Question

提问by Caitlin

I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe.

我是 sklearn 的一个相对较新的用户，并且在 sklearn.model_selection 的 train_test_split 中遇到了一些意外行为。我有一个 Pandas 数据框，我想将它分成训练集和测试集。我想将我的数据按至少 2 列分层，但最好在我的数据框中按 4 列分层。

There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior:

当我尝试这样做时，sklearn 没有发出警告，但是我后来发现我的最终数据集中有重复的行。我创建了一个示例测试来显示这种行为：

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

It seems to work as expected if I stratify by either column:

如果我按任一列分层，它似乎按预期工作：

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

But when I try to stratify by both columns, I get repeated values:

但是当我尝试按两列进行分层时，我得到了重复的值：

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 640000

Answer 1

回答by andrew_reece

The reason you're getting duplicates is because train_test_split()eventually defines strata as the unique set of valuesof whatever you passed into the stratifyargument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.

你得到重复的原因是因为train_test_split()最终将层定义为你传递给参数的任何值的唯一值集stratify。由于层是由两列定义的，一行数据可能代表多个层，因此抽样可能会选择同一行两次，因为它认为它是从不同的类中抽样的。

The train_test_split()function callsStratifiedShuffleSplit, which usesnp.unique()on y(which is what you pass in via stratify). From the source code:

该train_test_split()函数调用StratifiedShuffleSplit，它使用np.unique()上y（这是你在通过传递什么stratify）。从源代码：

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

Here's a simplified sample case, a variation on the example you provided:

这是一个简化的示例案例，是您提供的示例的变体：

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

The stratification function thinks there are four classes to split on: foo, bar, y, and z. But since these classes are essentially nested, meaning yand zboth show up in b == fooand b == bar, we'll get duplicates when the splitter tries to sample from each class.

分层功能认为有四类拆就：foo，bar，y，和z。但是由于这些类本质上是嵌套的，这意味着y和z都出现在b == foo和中b == bar，当拆分器尝试从每个类中采样时，我们会得到重复项。

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.band df.cas a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_splitis set up to do.

还有一个更大的设计问题在这里：你想要二手嵌套的分层抽样，还是你其实只是想对待每类df.b和df.c从一个单独的类来样？如果是后者，那就是你已经得到的。前者更复杂，这不是train_test_split设置要做的。

You might find this discussionof nested stratified sampling useful.

您可能会发现有关嵌套分层抽样的讨论很有用。

Answer 2

回答by Sesquipedalism

If you want train_test_splitto behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.

如果您想train_test_split按预期运行（按多个列进行分层，没有重复项），请创建一个新列，该列是其他列中的值的串联，并在新列上分层。

df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])

If you're worried about collision due to values like 11and 3and 1and 13both creating a concatenated value of 113, then you can add some arbitrary string in the middle:

如果您担心由于11and3和1and 之类的值而13产生的碰撞，并且两者都会创建的连接值113，那么您可以在中间添加一些任意字符串：

df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

Answer 3

回答by Louis T

What version of scikit-learn are you using ? You can use sklearn.__version__to check.

您使用的是什么版本的 scikit-learn？可以sklearn.__version__用来检查。

The prior to version 0.19.0, scikit-learn does not handle 2-dimensional stratification correctly. It is patched in 0.19.0.

在 0.19.0 版本之前，scikit-learn 不能正确处理二维分层。它在 0.19.0 中进行了修补。

It is describled in issue #9044.

它在issue #9044 中有描述。

Update your scikit-learn should fix the problem. If you can't update your scikit-learn, see this commit history herefor the fix.

更新您的 scikit-learn 应该可以解决问题。如果您无法更新 scikit-learn，请在此处查看此提交历史以获取修复。

Python sklearn train_test_split on pandas 按多列分层

提问by Caitlin

回答by andrew_reece

回答by Sesquipedalism

回答by Louis T

相关推荐

最近更新

标签

Python sklearn train_test_split on pandas 按多列分层

提问by Caitlin

回答by andrew_reece

回答by Sesquipedalism

回答by Louis T

相关推荐

Python read_csv 后在 Pandas 数据框中选择列时的关键错误

在 Python 中将元组转换为 int

Python 如何在 TensorFlow 图中添加 if 条件？

Python 如何将 Keras .h5 导出到 tensorflow .pb？

相关推荐

最近更新

标签