Python 如何将数据集拆分/划分为训练和测试数据集,例如交叉验证?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3674409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split/partition a dataset into training and test datasets for, e.g., cross validation?
提问by erik
What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartitionor crossvalindfunctions in Matlab.
将 NumPy 数组随机拆分为训练和测试/验证数据集的好方法是什么?类似于Matlab 中的cvpartition或crossvalind函数。
采纳答案by pberkes
If you want to split the data set once in two halves, you can use numpy.random.shuffle, or numpy.random.permutationif you need to keep track of the indices:
如果您想将数据集一分为二,您可以使用numpy.random.shuffle,或者numpy.random.permutation如果您需要跟踪索引:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
or
或者
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:
有很多方法可以重复分区相同的数据集以进行交叉验证。一种策略是从数据集中重新采样,重复:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]
Finally, sklearncontains several cross validation methods(k-fold, leave-n-out, ...). It also includes more advanced "stratified sampling"methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.
最后,sklearn包含几种交叉验证方法(k-fold、leave-n-out、...)。它还包括更高级的“分层抽样”方法,这些方法创建了一个关于某些特征平衡的数据分区,例如以确保训练和测试集中的正例和负例的比例相同。
回答by Colin
I wrote a function for my own project to do this (it doesn't use numpy, though):
我为自己的项目编写了一个函数来执行此操作(尽管它不使用 numpy):
def partition(seq, chunks):
"""Splits the sequence into equal sized chunks and them as a list"""
result = []
for i in range(chunks):
chunk = []
for element in seq[i:len(seq):chunks]:
chunk.append(element)
result.append(chunk)
return result
If you want the chunks to be randomized, just shuffle the list before passing it in.
如果您希望块被随机化,只需在传入之前对列表进行洗牌。
回答by Paulo Malvar
There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:
还有另一种选择只需要使用 scikit-learn。正如scikit 的 wiki 描述的那样,您只需使用以下说明:
from sklearn.model_selection import train_test_split
data, labels = np.arange(10).reshape((5, 2)), range(5)
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)
This way you can keep in sync the labels for the data you're trying to split into training and test.
通过这种方式,您可以使您尝试拆分为训练和测试的数据的标签保持同步。
回答by Apogentus
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
您还可以考虑将训练集和测试集分层划分。Startified Division 也随机生成训练和测试集,但保留原始类比例的方式。这使得训练集和测试集更好地反映了原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]
This code outputs:
此代码输出:
[1 2 3]
[1 2 3]
回答by offwhitelotus
Just a note. In case you want train, test, AND validation sets, you can do this:
只是一个注释。如果你想要训练、测试和验证集,你可以这样做:
from sklearn.cross_validation import train_test_split
X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)
These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.
这些参数将 70% 用于训练,15% 用于测试和验证集。希望这可以帮助。
回答by prashanth
Here is a code to split the data into n=5 folds in a stratified manner
这是一个以分层方式将数据拆分为 n=5 折的代码
% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
回答by Zahran
Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:
感谢 pberkes 的回答。我只是修改了它以避免(1)替换,而采样(2)在训练和测试中都发生了重复的实例:
training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)
回答by M. Mashaye
As sklearn.cross_validationmodule was deprecated, you can use:
由于sklearn.cross_validation模块已弃用,您可以使用:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)
回答by rotem
After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!
在做了一些阅读并考虑到将数据拆分以进行训练和测试的(许多..)不同方式之后,我决定计时!
I used 4 different methods (non of them are using the library sklearn, which I'm sure will give the best results, giving that it is well designed and tested code):
我使用了 4 种不同的方法(它们都没有使用库 sklearn,我相信它会给出最好的结果,因为它是经过精心设计和测试的代码):
- shuffle the whole matrix arr and then split the data to train and test
- shuffle the indices and then assign it x and y to split the data
- same as method 2, but in a more efficient way to do it
- using pandas dataframe to split
- 对整个矩阵 arr 进行洗牌,然后将数据拆分以进行训练和测试
- 对索引进行混洗,然后将其分配给 x 和 y 以拆分数据
- 与方法 2 相同,但以更有效的方式进行
- 使用 Pandas 数据框进行拆分
method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.
方法3以最短的时间获胜,之后方法1,发现方法2和4效率很低。
The code for the 4 different methods I timed:
我计时的 4 种不同方法的代码:
import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)
#%% Method 1: shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]
#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]
test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]
#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0]) # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]
#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)
train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)
And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:
对于时间,执行 1000 次循环的 3 次重复中的最短时间是:
- Method 1: 0.35883826200006297 seconds
- Method 2: 1.7157016959999964 seconds
- Method 3: 1.7876616719995582 seconds
- Method 4: 0.07562861499991413 seconds
- 方法一:0.35883826200006297秒
- 方法二:1.7157016959999964秒
- 方法三:1.7876616719995582秒
- 方法四:0.07562861499991413秒
I hope that's helpful!
我希望这有帮助!
回答by B.Mr.W.
Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.
您可能不仅需要拆分为训练和测试,还需要进行交叉验证以确保您的模型具有泛化能力。在这里,我假设 70% 的训练数据、20% 的验证和 10% 的坚持/测试数据。
Check out the np.split:
查看np.split:
If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in
ary[:2] ary[2:3] ary[3:]
如果indices_or_sections 是一个一维排序整数数组,则条目指示数组沿轴的何处拆分。例如,[2, 3] 将,对于轴 = 0,导致
亚里[:2] 亚里[2:3] 亚里[3:]
t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

