Python 如何将自定义数据集拆分为训练数据集和测试数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50544730/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I split a custom dataset into training and test datasets?
提问by nirvair
import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset
class CustomDatasetFromCSV(Dataset):
def __init__(self, csv_path, transform=None):
self.data = pd.read_csv(csv_path)
self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
self.height = 48
self.width = 48
self.transform = transform
def __getitem__(self, index):
pixels = self.data['pixels'].tolist()
faces = []
for pixel_sequence in pixels:
face = [int(pixel) for pixel in pixel_sequence.split(' ')]
# print(np.asarray(face).shape)
face = np.asarray(face).reshape(self.width, self.height)
face = cv2.resize(face.astype('uint8'), (self.width, self.height))
faces.append(face.astype('float32'))
faces = np.asarray(faces)
faces = np.expand_dims(faces, -1)
return faces, self.labels
def __len__(self):
return len(self.data)
This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.
这是我可以通过使用来自其他存储库的引用来设法做到的。但是,我想将此数据集拆分为训练和测试。
How can I do that inside this class? Or do I need to make a separate class to do that?
我怎么能在这个班级里做到这一点?或者我需要创建一个单独的类来做到这一点?
回答by Fábio Perez
Starting in PyTorch 0.4.1 you can use random_split
:
从 PyTorch 0.4.1 开始,您可以使用random_split
:
train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
回答by benjaminplanche
Using Pytorch's SubsetRandomSampler
:
使用 Pytorch 的SubsetRandomSampler
:
import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler
class CustomDatasetFromCSV(Dataset):
def __init__(self, csv_path, transform=None):
self.data = pd.read_csv(csv_path)
self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
self.height = 48
self.width = 48
self.transform = transform
def __getitem__(self, index):
# This method should return only 1 sample and label
# (according to "index"), not the whole dataset
# So probably something like this for you:
pixel_sequence = self.data['pixels'][index]
face = [int(pixel) for pixel in pixel_sequence.split(' ')]
face = np.asarray(face).reshape(self.width, self.height)
face = cv2.resize(face.astype('uint8'), (self.width, self.height))
label = self.labels[index]
return face, label
def __len__(self):
return len(self.labels)
dataset = CustomDatasetFromCSV(my_path)
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42
# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
np.random.seed(random_seed)
np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]
# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
sampler=valid_sampler)
# Usage Example:
num_epochs = 10
for epoch in range(num_epochs):
# Train:
for batch_index, (faces, labels) in enumerate(train_loader):
# ...
回答by Shital Shah
Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.
当前的答案进行随机拆分,其缺点是不能保证每个类的样本数是平衡的。当您希望每个类的样本数量较少时,这尤其成问题。例如,MNIST 有 60,000 个示例,即每个数字 6000 个。假设您只需要训练集中每个数字 30 个示例。在这种情况下,随机拆分可能会产生类之间的不平衡(一位数字的训练数据多于其他数字)。因此,您要确保每个数字恰好只有 30 个标签。这称为分层抽样。
One way to do this is using sampler interface in Pytorch and sample code is here.
一种方法是在 Pytorch 中使用采样器接口,示例代码在这里。
Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where ds
is MNIST dataset and k
is number of samples needed for each class.
另一种方法就是破解你的方法:)。例如,下面是 MNIST 的简单实现,其中ds
是 MNIST 数据集和k
每个类所需的样本数。
def sampleFromClass(ds, k):
class_counts = {}
train_data = []
train_label = []
test_data = []
test_label = []
for data, label in ds:
c = label.item()
class_counts[c] = class_counts.get(c, 0) + 1
if class_counts[c] <= k:
train_data.append(data)
train_label.append(torch.unsqueeze(label, 0))
else:
test_data.append(data)
test_label.append(torch.unsqueeze(label, 0))
train_data = torch.cat(train_data)
for ll in train_label:
print(ll)
train_label = torch.cat(train_label)
test_data = torch.cat(test_data)
test_label = torch.cat(test_label)
return (TensorDataset(train_data, train_label),
TensorDataset(test_data, test_label))
You can use this function like this:
你可以像这样使用这个函数:
def main():
train_ds = datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor()
]))
train_ds, test_ds = sampleFromClass(train_ds, 3)
回答by prosti
This is the PyTorch Subset
class attached holding the random_split
method. Note that this method is base for the SubsetRandomSampler
.
这是Subset
附加的 PyTorch类,用于保存该random_split
方法。请注意,此方法是SubsetRandomSampler
.
For MNIST if we use random_split
:
对于 MNIST,如果我们使用random_split
:
loader = DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.5,), (0.5,))
])),
batch_size=16, shuffle=False)
print(loader.dataset.data.shape)
test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
print(test_ds, valid_ds)
print(test_ds.indices, valid_ds.indices)
print(test_ds.indices.shape, valid_ds.indices.shape)
We get:
我们得到:
torch.Size([60000, 28, 28])
<torch.utils.data.dataset.Subset object at 0x0000020FD1880B00> <torch.utils.data.dataset.Subset object at 0x0000020FD1880C50>
tensor([ 1520, 4155, 45472, ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067, ..., 3950, 37306, 31400])
torch.Size([50000]) torch.Size([10000])
Our test_ds.indices
and valid_ds.indices
will be random from range (0, 600000)
. But if I would like to get sequence of indices from (0, 49999)
and from (50000, 59999)
I cannot do that at the moment unfortunately, except thisway.
我们的test_ds.indices
和valid_ds.indices
将从 range 随机(0, 600000)
。但是,如果我想获得来自指数的序列(0, 49999)
,并从(50000, 59999)
我不能这样做,在目前遗憾的是,除了这种方式。
Handy in case you run the MNIST benchmarkwhere it is predefined what should be the test and what should be the validation dataset.
如果您运行MNIST 基准测试,它会很方便,其中预定义了应该是测试的内容以及应该是验证数据集的内容。
回答by prosti
Custom dataset has a special meaning in PyTorch, but I think you meant any dataset. Let's check out the MNIST dataset (this is probable the most famous dataset for the beginners).
自定义数据集在 PyTorch 中具有特殊含义,但我认为您指的是任何数据集。让我们看看 MNIST 数据集(这可能是初学者最著名的数据集)。
import torch, torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader, Dataset, TensorDataset
train_loader = DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.5,), (0.5,))
])),
batch_size=16, shuffle=False)
print(train_loader.dataset.data.shape)
test_ds = train_loader.dataset.data[:50000, :, :]
valid_ds = train_loader.dataset.data[50000:, :, :]
print(test_ds.shape)
print(valid_ds.shape)
test_dst = train_loader.dataset.targets.data[:50000]
valid_dst = train_loader.dataset.targets.data[50000:]
print(test_dst, test_dst.shape)
print(valid_dst, valid_dst.shape)
What this will outupt, is size of the original [60000, 28, 28]
, then the splits [50000, 28, 28]
for test and [10000, 28, 28]
for validation:
这将输出的是原始的大小[60000, 28, 28]
,然后是[50000, 28, 28]
用于测试和[10000, 28, 28]
验证的拆分:
torch.Size([60000, 28, 28])
torch.Size([50000, 28, 28])
torch.Size([10000, 28, 28])
tensor([5, 0, 4, ..., 8, 4, 8]) torch.Size([50000])
tensor([3, 8, 6, ..., 5, 6, 8]) torch.Size([10000])
Additional info if you actually plan to pair images and labels (targets) together
如果您确实计划将图像和标签(目标)配对在一起,则附加信息
bs = 16
test_dl = DataLoader(TensorDataset(test_ds, test_dst), batch_size=bs, shuffle=True)
for xb, yb in test_dl:
# Do your work