pandas 如何将数据分成 3 组(训练、验证和测试)?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38250710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split data into 3 sets (train, validation and test)?
提问by CentAu
I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_splitfrom sklearn.cross_validation
, one can divide the data in two sets (train and test). However, I couldn't find any solution about splitting the data into three sets. Preferably, I'd like to have the indices of the original data.
我有一个 Pandas 数据框,我希望将它分成 3 个单独的集合。我知道使用train_test_splitfrom sklearn.cross_validation
,可以将数据分为两组(训练和测试)。但是,我找不到将数据分成三组的任何解决方案。最好,我想要原始数据的索引。
I know that a workaround would be to use train_test_split
two times and somehow adjust the indices. But is there a more standard / built-in way to split the data into 3 sets instead of 2?
我知道一种解决方法是使用train_test_split
两次并以某种方式调整索引。但是有没有更标准/内置的方法将数据分成 3 组而不是 2 组?
回答by MaxU
Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1)) and then split our data set into the following parts:
麻木的解决方案。我们将首先打乱整个数据集 (df.sample(frac=1)),然后将我们的数据集拆分为以下部分:
- 60% - train set,
- 20% - validation set,
- 20% - test set
- 60% - 训练集,
- 20% - 验证集,
- 20% - 测试集
In [305]: train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
In [306]: train
Out[306]:
A B C D E
0 0.046919 0.792216 0.206294 0.440346 0.038960
2 0.301010 0.625697 0.604724 0.936968 0.870064
1 0.642237 0.690403 0.813658 0.525379 0.396053
9 0.488484 0.389640 0.599637 0.122919 0.106505
8 0.842717 0.793315 0.554084 0.100361 0.367465
7 0.185214 0.603661 0.217677 0.281780 0.938540
In [307]: validate
Out[307]:
A B C D E
5 0.806176 0.008896 0.362878 0.058903 0.026328
6 0.145777 0.485765 0.589272 0.806329 0.703479
In [308]: test
Out[308]:
A B C D E
4 0.521640 0.332210 0.370177 0.859169 0.401087
3 0.333348 0.964011 0.083498 0.670386 0.169619
[int(.6*len(df)), int(.8*len(df))]
- is an indices_or_sections
array for numpy.split().
[int(.6*len(df)), int(.8*len(df))]
- 是numpy.split()的indices_or_sections
数组。
Here is a small demo for np.split()
usage - let's split 20-elements array into the following parts: 80%, 10%, 10%:
这是一个np.split()
使用的小演示- 让我们将 20 个元素的数组分成以下部分:80%、10%、10%:
In [45]: a = np.arange(1, 21)
In [46]: a
Out[46]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]),
array([17, 18]),
array([19, 20])]
回答by piRSquared
Note:
笔记:
Function was written to handle seeding of randomized set creation. You should not rely on set splitting that doesn't randomize the sets.
编写函数来处理随机集创建的播种。您不应该依赖不会随机化集合的集合拆分。
import numpy as np
import pandas as pd
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
validate_end = int(validate_percent * m) + train_end
train = df.iloc[perm[:train_end]]
validate = df.iloc[perm[train_end:validate_end]]
test = df.iloc[perm[validate_end:]]
return train, validate, test
Demonstration
示范
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))
df
train, validate, test = train_validate_test_split(df)
train
validate
test
回答by blitu12345
However, one approach to dividing the dataset into train
, test
, cv
with 0.6
, 0.2
, 0.2
would be to use the train_test_split
method twice.
然而,一种方法将所述数据集成train
,test
,cv
与0.6
,0.2
,0.2
是使用该train_test_split
方法的两倍。
from sklearn.model_selection import train_test_split
x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)
回答by stackoverflowuser2010
Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split()
twice.
这是一个 Python 函数,它使用分层抽样将 Pandas 数据帧拆分为训练、验证和测试数据帧。它通过train_test_split()
两次调用 scikit-learn 的函数来执行此拆分。
import pandas as pd
from sklearn.model_selection import train_test_split
def split_stratified_into_train_val_test(df_input, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
random_state=None):
'''
Splits a Pandas dataframe into three subsets (train, val, and test)
following fractional ratios provided by the user, where each subset is
stratified by the values in a specific column (that is, each subset has
the same relative frequency of the values in the column). It performs this
splitting by running train_test_split() twice.
Parameters
----------
df_input : Pandas dataframe
Input dataframe to be split.
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
'''
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
(frac_train, frac_val, frac_test))
if stratify_colname not in df_input.columns:
raise ValueError('%s is not a column in the dataframe' % (stratify_colname))
X = df_input # Contains all columns.
y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.
# Split original dataframe into train and temp dataframes.
df_train, df_temp, y_train, y_temp = train_test_split(X,
y,
stratify=y,
test_size=(1.0 - frac_train),
random_state=random_state)
# Split the temp dataframe into val and test dataframes.
relative_frac_test = frac_test / (frac_val + frac_test)
df_val, df_test, y_val, y_test = train_test_split(df_temp,
y_temp,
stratify=y_temp,
test_size=relative_frac_test,
random_state=random_state)
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
Below is a complete working example.
下面是一个完整的工作示例。
Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo
, 15% bar
and 10% baz
. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:
考虑一个具有标签的数据集,您要在该标签上执行分层。这个标签在原始数据集中有自己的分布,比如 75% foo
、 15%bar
和 10% baz
。现在让我们使用 60/20/20 的比率将数据集拆分为训练、验证和测试子集,其中每个拆分保留相同的标签分布。请参阅下图:
Here is the example dataset:
这是示例数据集:
df = pd.DataFrame( { 'A': list(range(0, 100)),
'B': list(range(100, 0, -1)),
'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )
df.head()
# A B label
# 0 0 100 foo
# 1 1 99 foo
# 2 2 98 foo
# 3 3 97 foo
# 4 4 96 foo
df.shape
# (100, 3)
df.label.value_counts()
# foo 75
# bar 15
# baz 10
# Name: label, dtype: int64
Now, let's call the split_stratified_into_train_val_test()
function from above to get train, validation, and test dataframes following a 60/20/20 ratio.
现在,让我们split_stratified_into_train_val_test()
从上面调用函数以按照 60/20/20 的比例获取训练、验证和测试数据帧。
df_train, df_val, df_test = \
split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)
The three dataframes df_train
, df_val
, and df_test
contain all the original rows but their sizes will follow the above ratio.
三个数据帧df_train
、df_val
和df_test
包含所有原始行,但它们的大小将遵循上述比例。
df_train.shape
#(60, 3)
df_val.shape
#(20, 3)
df_test.shape
#(20, 3)
Further, each of the three splits will have the same distribution of the label, namely 75% foo
, 15% bar
and 10% baz
.
此外,三个分割中的每一个都将具有相同的标签分布,即 75% foo
、 15%bar
和 10% baz
。
df_train.label.value_counts()
# foo 45
# bar 9
# baz 6
# Name: label, dtype: int64
df_val.label.value_counts()
# foo 15
# bar 3
# baz 2
# Name: label, dtype: int64
df_test.label.value_counts()
# foo 15
# bar 3
# baz 2
# Name: label, dtype: int64
回答by A.Ametov
It is very convenient to use train_test_split
without performing reindexing after dividing to several sets and not writing some additional code. Best answer above does not mention that by separating two times using train_test_split
not changing partition sizes won`t give initially intended partition:
使用起来非常方便,train_test_split
分割成几组后不进行重新索引,不写一些额外的代码。上面的最佳答案没有提到通过使用train_test_split
不改变分区大小分隔两次不会给出最初预期的分区:
x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))
Then the portion of validation and test sets in the x_remain changeand could be counted as
然后x_remain 中验证和测试集的部分发生变化,可以算作
new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0
new_val_size = 1.0 - new_test_size
x_val, x_test = train_test_split(x_remain, test_size=new_test_size)
In this occasion all initial partitions are saved.
在这种情况下,所有初始分区都被保存。