Python 如何将数据随机分成训练集和测试集?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17412439/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:08:17  来源:igfitidea点击:

How to split data into trainset and testset randomly?

pythonfile-io

提问by Freya Ren

I have a large dataset and want to split it into training(50%) and testing set(50%).

我有一个很大的数据集,想把它分成训练(50%)和测试集(50%)。

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

假设我有 100 个示例存储在输入文件中,每一行包含一个示例。我需要选择 50 行作为训练集和 50 行测试集。

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

我的想法是首先生成一个长度为 100(值范围从 1 到 100)的随机列表,然后使用前 50 个元素作为 50 个训练示例的行号。与测试集相同。

This could be achieved easily in Matlab

这可以在 Matlab 中轻松实现

fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
    trainstring = C{plist(i)};
    fprintf(train_file,trainstring);
end
for i=51:100
    teststring = C{plist(i)};
    fprintf(test_file,teststring);
end

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

但是我怎么能在 Python 中完成这个功能呢?我是 Python 新手,不知道是否可以将整个文件读入数组,并选择某些行。

采纳答案by ijmarshall

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

这可以在 Python 中使用列表类似地完成,(请注意,整个列表都在原地打乱)。

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]

回答by aehs29

Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPywhich is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users

首先,Python 中没有“数组”这样的东西,Python 使用列表,这确实有所不同,我建议您使用NumPy,它是一个非常好的 Python 库,它添加了许多类似 Matlab 的功能。你Matlab 用户可以从这里开始使用Numpy

回答by Lord Henry Wotton

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

以下生成更一般的 k 折交叉验证拆分。您的 50-50 分区将通过k=2以下方式实现,您所要做的就是选择生成的两个分区之一。注意:我还没有测试过代码,但我很确定它应该可以工作。

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()

    # Shuffle input
    random.seed=myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_part=int(math.ceil(len(data)/float(k)))

    # Create one partition per fold
    train={}
    test={}
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]

    return train, test      

回答by JLT

You could also use numpy. When your data is stored in a numpy.ndarray:

你也可以使用 numpy。当您的数据存储在 numpy.ndarray 中时:

import numpy as np
from random import sample
l = 100 #length of data 
f = 50  #number of elements you need
indices = sample(range(l),f)

train_data = data[indices]
test_data = np.delete(data,indices)

回答by Roman Gherta

You can try this approach

你可以试试这个方法

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

UPDATE: train_test_splitwas moved to model_selectionso the current way (scikit-learn 0.22.2) to do it is this:

更新train_test_split已转移到model_selection当前的方式(scikit-learn 0.22.2)是这样的:

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)

回答by shubhranshu

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

回答by subin sahayam

To answer @desmond.carros question, I modified the best answer as follows,

为了回答@desmond.carros 的问题,我将最佳答案修改如下,

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set

The code splits the entire dataset to 80% train and 20% test data

代码将整个数据集拆分为 80% 的训练数据和 20% 的测试数据

回答by Andrew

sklearn.cross_validationis deprecated since version 0.18, instead you should use sklearn.model_selectionas show below

sklearn.cross_validation自 0.18 版起已弃用,您应该使用sklearn.model_selection如下所示

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

回答by lee

A quick note for the answer from @subin sahayam

对@subin sahayam 的回答的简要说明

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set

If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1.

如果您的列表大小是偶数,则不应在下面的代码中添加 1。相反,您需要先检查列表的大小,然后确定是否需要添加 1。

test_data = data[int(len(data)*.80+1):]

test_data = data[int(len(data)*.80+1):]