Python 如何使用 Pandas 从一个数据帧创建测试和训练样本？

Question

提问by tooty44

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

我有一个数据框形式的相当大的数据集，我想知道如何将数据框分成两个随机样本（80% 和 20%）进行训练和测试。

Thanks!

谢谢！

Answer 1

采纳答案by Andy Hayden

I would just use numpy's randn:

我只会使用 numpy 的randn：

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

只是为了看看这是否有效：

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Answer 2

回答by gobrewers14

scikit learn's train_test_splitis a good one.

scikit learntrain_test_split是一个不错的选择。

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

Answer 3

回答by Apogentus

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

您还可以考虑将训练集和测试集分层划分。Startified Division 也随机生成训练和测试集，但保留原始类比例的方式。这使得训练集和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.

df[train_inds] 和 df[test_inds] 为您提供原始 DataFrame df 的训练和测试集。

Answer 4

回答by Anarcho-Chossid

This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).

这是我在需要拆分 DataFrame 时所写的内容。我考虑过使用上面安迪的方法，但不喜欢我无法精确控制数据集的大小（即有时是 79，有时是 81 等）。

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

Answer 5

回答by Napitupulu Jon

I would use scikit-learn's own training_test_split, and generate it from the index

我会使用 scikit-learn 自己的 training_test_split，并从索引生成它

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

Answer 6

回答by Johnny V

If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:

如果您希望输入一个数据帧并输出两个数据帧（不是 numpy 数组），这应该可以解决问题：

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

Answer 7

回答by Hakim

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.

如果您想稍后添加列，我认为您还需要获取副本而不是数据帧的切片。

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

Answer 8

回答by kiran6

You can make use of df.as_matrix() function and create Numpy-array and pass it.

您可以使用 df.as_matrix() 函数并创建 Numpy-array 并传递它。

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

Answer 9

回答by PagMax

Pandas random sample will also work

熊猫随机样本也可以

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)

Answer 10

回答by Akash Jain

How about this? df is my dataframe

这个怎么样？df 是我的数据框

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

Python 如何使用 Pandas 从一个数据帧创建测试和训练样本？

提问by tooty44

采纳答案by Andy Hayden

回答by gobrewers14

回答by Apogentus

回答by Anarcho-Chossid

回答by Napitupulu Jon

回答by Johnny V

回答by Hakim

回答by kiran6

回答by PagMax

回答by Akash Jain

相关推荐

最近更新

标签

Python 如何使用 Pandas 从一个数据帧创建测试和训练样本？

提问by tooty44

采纳答案by Andy Hayden

回答by gobrewers14

回答by Apogentus

回答by Anarcho-Chossid

回答by Napitupulu Jon

回答by Johnny V

回答by Hakim

回答by kiran6

回答by PagMax

回答by Akash Jain

相关推荐

有没有办法在 Python 中定义一个浮点数组？

Python pyvenv-3.4 返回非零退出状态 1

在 Sublime Text 3 中运行 Python 调试器 (pdb)

如何从python列表中的字符串中删除双引号？

相关推荐

最近更新

标签