pandas 如何将数据分成 3 组(训练、验证和测试)?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38250710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:43:59  来源:igfitidea点击:

How to split data into 3 sets (train, validation and test)?

pandasnumpydataframemachine-learningscikit-learn

提问by CentAu

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_splitfrom sklearn.cross_validation, one can divide the data in two sets (train and test). However, I couldn't find any solution about splitting the data into three sets. Preferably, I'd like to have the indices of the original data.

我有一个 Pandas 数据框,我希望将它分成 3 个单独的集合。我知道使用train_test_splitfrom sklearn.cross_validation,可以将数据分为两组(训练和测试)。但是,我找不到将数据分成三组的任何解决方案。最好,我想要原始数据的索引。

I know that a workaround would be to use train_test_splittwo times and somehow adjust the indices. But is there a more standard / built-in way to split the data into 3 sets instead of 2?

我知道一种解决方法是使用train_test_split两次并以某种方式调整索引。但是有没有更标准/内置的方法将数据分成 3 组而不是 2 组?

回答by MaxU

Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1)) and then split our data set into the following parts:

麻木的解决方案。我们将首先打乱整个数据集 (df.sample(frac=1)),然后将我们的数据集拆分为以下部分:

  • 60% - train set,
  • 20% - validation set,
  • 20% - test set
  • 60% - 训练集,
  • 20% - 验证集,
  • 20% - 测试集


In [305]: train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

In [306]: train
Out[306]:
          A         B         C         D         E
0  0.046919  0.792216  0.206294  0.440346  0.038960
2  0.301010  0.625697  0.604724  0.936968  0.870064
1  0.642237  0.690403  0.813658  0.525379  0.396053
9  0.488484  0.389640  0.599637  0.122919  0.106505
8  0.842717  0.793315  0.554084  0.100361  0.367465
7  0.185214  0.603661  0.217677  0.281780  0.938540

In [307]: validate
Out[307]:
          A         B         C         D         E
5  0.806176  0.008896  0.362878  0.058903  0.026328
6  0.145777  0.485765  0.589272  0.806329  0.703479

In [308]: test
Out[308]:
          A         B         C         D         E
4  0.521640  0.332210  0.370177  0.859169  0.401087
3  0.333348  0.964011  0.083498  0.670386  0.169619

[int(.6*len(df)), int(.8*len(df))]- is an indices_or_sectionsarray for numpy.split().

[int(.6*len(df)), int(.8*len(df))]- 是numpy.split()indices_or_sections数组。

Here is a small demo for np.split()usage - let's split 20-elements array into the following parts: 80%, 10%, 10%:

这是一个np.split()使用的小演示- 让我们将 20 个元素的数组分成以下部分:80%、10%、10%:

In [45]: a = np.arange(1, 21)

In [46]: a
Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
Out[47]:
[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

回答by piRSquared

Note:

笔记:

Function was written to handle seeding of randomized set creation. You should not rely on set splitting that doesn't randomize the sets.

编写函数来处理随机集创建的播种。您不应该依赖不会随机化集合的集合拆分。

import numpy as np
import pandas as pd

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

Demonstration

示范

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))
df

enter image description here

在此处输入图片说明

train, validate, test = train_validate_test_split(df)

train

enter image description here

在此处输入图片说明

validate

enter image description here

在此处输入图片说明

test

enter image description here

在此处输入图片说明

回答by blitu12345

However, one approach to dividing the dataset into train, test, cvwith 0.6, 0.2, 0.2would be to use the train_test_splitmethod twice.

然而,一种方法将所述数据集成traintestcv0.60.20.2是使用该train_test_split方法的两倍。

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

回答by stackoverflowuser2010

Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split()twice.

这是一个 Python 函数,它使用分层抽样将 Pandas 数据帧拆分为训练、验证和测试数据帧。它通过train_test_split()两次调用 scikit-learn 的函数来执行此拆分。

import pandas as pd
from sklearn.model_selection import train_test_split

def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                         frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                         random_state=None):
    '''
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    '''

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                         (frac_train, frac_val, frac_test))

    if stratify_colname not in df_input.columns:
        raise ValueError('%s is not a column in the dataframe' % (stratify_colname))

    X = df_input # Contains all columns.
    y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                          y,
                                                          stratify=y,
                                                          test_size=(1.0 - frac_train),
                                                          random_state=random_state)

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                      y_temp,
                                                      stratify=y_temp,
                                                      test_size=relative_frac_test,
                                                      random_state=random_state)

    assert len(df_input) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test

Below is a complete working example.

下面是一个完整的工作示例。

Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% barand 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

考虑一个具有标签的数据集,您要在该标签上执行分层。这个标签在原始数据集中有自己的分布,比如 75% foo、 15%bar和 10% baz。现在让我们使用 60/20/20 的比率将数据集拆分为训练、验证和测试子集,其中每个拆分保留相同的标签分布。请参阅下图:

enter image description here

在此处输入图片说明

Here is the example dataset:

这是示例数据集:

df = pd.DataFrame( { 'A': list(range(0, 100)),
                     'B': list(range(100, 0, -1)),
                     'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )

df.head()
#    A    B label
# 0  0  100   foo
# 1  1   99   foo
# 2  2   98   foo
# 3  3   97   foo
# 4  4   96   foo

df.shape
# (100, 3)

df.label.value_counts()
# foo    75
# bar    15
# baz    10
# Name: label, dtype: int64

Now, let's call the split_stratified_into_train_val_test()function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

现在,让我们split_stratified_into_train_val_test()从上面调用函数以按照 60/20/20 的比例获取训练、验证和测试数据帧。

df_train, df_val, df_test = \
    split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)

The three dataframes df_train, df_val, and df_testcontain all the original rows but their sizes will follow the above ratio.

三个数据帧df_traindf_valdf_test包含所有原始行,但它们的大小将遵循上述比例。

df_train.shape
#(60, 3)

df_val.shape
#(20, 3)

df_test.shape
#(20, 3)

Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% barand 10% baz.

此外,三个分割中的每一个都将具有相同的标签分布,即 75% foo、 15%bar和 10% baz

df_train.label.value_counts()
# foo    45
# bar     9
# baz     6
# Name: label, dtype: int64

df_val.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

df_test.label.value_counts()
# foo    15
# bar     3
# baz     2
# Name: label, dtype: int64

回答by A.Ametov

It is very convenient to use train_test_splitwithout performing reindexing after dividing to several sets and not writing some additional code. Best answer above does not mention that by separating two times using train_test_splitnot changing partition sizes won`t give initially intended partition:

使用起来非常方便,train_test_split分割成几组后不进行重新索引,不写一些额外的代码。上面的最佳答案没有提到通过使用train_test_split不改变分区大小分隔两次不会给出最初预期的分区:

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

Then the portion of validation and test sets in the x_remain changeand could be counted as

然后x_remain 中验证和测试集的部分发生变化,可以算作

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

In this occasion all initial partitions are saved.

在这种情况下,所有初始分区都被保存。