Python 如何将自定义数据集拆分为训练数据集和测试数据集？

Question

提问by nirvair

import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        pixels = self.data['pixels'].tolist()
        faces = []
        for pixel_sequence in pixels:
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            # print(np.asarray(face).shape)
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            faces.append(face.astype('float32'))
        faces = np.asarray(faces)
        faces = np.expand_dims(faces, -1)
        return faces, self.labels

    def __len__(self):
        return len(self.data)

This is what I could manage to do by using references from other repositories. However, I want to split this dataset into train and test.

这是我可以通过使用来自其他存储库的引用来设法做到的。但是，我想将此数据集拆分为训练和测试。

How can I do that inside this class? Or do I need to make a separate class to do that?

我怎么能在这个班级里做到这一点？或者我需要创建一个单独的类来做到这一点？

Answer 1

回答by Fábio Perez

Starting in PyTorch 0.4.1 you can use random_split:

从 PyTorch 0.4.1 开始，您可以使用random_split：

train_size = int(0.8 * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])

Answer 2

回答by benjaminplanche

Using Pytorch's SubsetRandomSampler:

使用 Pytorch 的SubsetRandomSampler：

import torch
import numpy as np
from torchvision import datasets
from torchvision import transforms
from torch.utils.data.sampler import SubsetRandomSampler

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path, transform=None):
        self.data = pd.read_csv(csv_path)
        self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
        self.height = 48
        self.width = 48
        self.transform = transform

    def __getitem__(self, index):
        # This method should return only 1 sample and label 
        # (according to "index"), not the whole dataset
        # So probably something like this for you:
        pixel_sequence = self.data['pixels'][index]
        face = [int(pixel) for pixel in pixel_sequence.split(' ')]
        face = np.asarray(face).reshape(self.width, self.height)
        face = cv2.resize(face.astype('uint8'), (self.width, self.height))
        label = self.labels[index]

        return face, label

    def __len__(self):
        return len(self.labels)


dataset = CustomDatasetFromCSV(my_path)
batch_size = 16
validation_split = .2
shuffle_dataset = True
random_seed= 42

# Creating data indices for training and validation splits:
dataset_size = len(dataset)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                           sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                                sampler=valid_sampler)

# Usage Example:
num_epochs = 10
for epoch in range(num_epochs):
    # Train:   
    for batch_index, (faces, labels) in enumerate(train_loader):
        # ...

Answer 3

回答by Shital Shah

Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.

当前的答案进行随机拆分，其缺点是不能保证每个类的样本数是平衡的。当您希望每个类的样本数量较少时，这尤其成问题。例如，MNIST 有 60,000 个示例，即每个数字 6000 个。假设您只需要训练集中每个数字 30 个示例。在这种情况下，随机拆分可能会产生类之间的不平衡（一位数字的训练数据多于其他数字）。因此，您要确保每个数字恰好只有 30 个标签。这称为分层抽样。

One way to do this is using sampler interface in Pytorch and sample code is here.

一种方法是在 Pytorch 中使用采样器接口，示例代码在这里。

Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where dsis MNIST dataset and kis number of samples needed for each class.

另一种方法就是破解你的方法:)。例如，下面是 MNIST 的简单实现，其中ds是 MNIST 数据集和k每个类所需的样本数。

def sampleFromClass(ds, k):
    class_counts = {}
    train_data = []
    train_label = []
    test_data = []
    test_label = []
    for data, label in ds:
        c = label.item()
        class_counts[c] = class_counts.get(c, 0) + 1
        if class_counts[c] <= k:
            train_data.append(data)
            train_label.append(torch.unsqueeze(label, 0))
        else:
            test_data.append(data)
            test_label.append(torch.unsqueeze(label, 0))
    train_data = torch.cat(train_data)
    for ll in train_label:
        print(ll)
    train_label = torch.cat(train_label)
    test_data = torch.cat(test_data)
    test_label = torch.cat(test_label)

    return (TensorDataset(train_data, train_label), 
        TensorDataset(test_data, test_label))

You can use this function like this:

你可以像这样使用这个函数：

def main():
    train_ds = datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
    train_ds, test_ds = sampleFromClass(train_ds, 3)

Answer 4

回答by prosti

This is the PyTorch Subsetclass attached holding the random_splitmethod. Note that this method is base for the SubsetRandomSampler.

这是Subset附加的 PyTorch类，用于保存该random_split方法。请注意，此方法是SubsetRandomSampler.

For MNIST if we use random_split:

对于 MNIST，如果我们使用random_split：

loader = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.5,), (0.5,))
                             ])),
  batch_size=16, shuffle=False)

print(loader.dataset.data.shape)
test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
print(test_ds, valid_ds)
print(test_ds.indices, valid_ds.indices)
print(test_ds.indices.shape, valid_ds.indices.shape)

We get:

我们得到：

torch.Size([60000, 28, 28])
<torch.utils.data.dataset.Subset object at 0x0000020FD1880B00> <torch.utils.data.dataset.Subset object at 0x0000020FD1880C50>
tensor([ 1520,  4155, 45472,  ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067,  ...,  3950, 37306, 31400])
torch.Size([50000]) torch.Size([10000])

Our test_ds.indicesand valid_ds.indiceswill be random from range (0, 600000). But if I would like to get sequence of indices from (0, 49999)and from (50000, 59999)I cannot do that at the moment unfortunately, except thisway.

我们的test_ds.indices和valid_ds.indices将从 range 随机(0, 600000)。但是，如果我想获得来自指数的序列(0, 49999)，并从(50000, 59999)我不能这样做，在目前遗憾的是，除了这种方式。

Handy in case you run the MNIST benchmarkwhere it is predefined what should be the test and what should be the validation dataset.

如果您运行MNIST 基准测试，它会很方便，其中预定义了应该是测试的内容以及应该是验证数据集的内容。

Answer 5

回答by prosti

Custom dataset has a special meaning in PyTorch, but I think you meant any dataset. Let's check out the MNIST dataset (this is probable the most famous dataset for the beginners).

自定义数据集在 PyTorch 中具有特殊含义，但我认为您指的是任何数据集。让我们看看 MNIST 数据集（这可能是初学者最著名的数据集）。

import torch, torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader, Dataset, TensorDataset
train_loader = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.5,), (0.5,))
                             ])),
  batch_size=16, shuffle=False)

print(train_loader.dataset.data.shape)

test_ds =  train_loader.dataset.data[:50000, :, :]
valid_ds =  train_loader.dataset.data[50000:, :, :]
print(test_ds.shape)
print(valid_ds.shape)

test_dst =  train_loader.dataset.targets.data[:50000]
valid_dst =  train_loader.dataset.targets.data[50000:]
print(test_dst, test_dst.shape)
print(valid_dst, valid_dst.shape)

What this will outupt, is size of the original [60000, 28, 28], then the splits [50000, 28, 28]for test and [10000, 28, 28]for validation:

这将输出的是原始的大小[60000, 28, 28]，然后是[50000, 28, 28]用于测试和[10000, 28, 28]验证的拆分：

torch.Size([60000, 28, 28])
torch.Size([50000, 28, 28])
torch.Size([10000, 28, 28])
tensor([5, 0, 4,  ..., 8, 4, 8]) torch.Size([50000])
tensor([3, 8, 6,  ..., 5, 6, 8]) torch.Size([10000])

_{Additional info if you actually plan to pair images and labels (targets) together}

_{如果您确实计划将图像和标签（目标）配对在一起，则附加信息}

bs = 16
test_dl = DataLoader(TensorDataset(test_ds, test_dst), batch_size=bs, shuffle=True)

for xb, yb in test_dl:
    # Do your work

Python 如何将自定义数据集拆分为训练数据集和测试数据集？

提问by nirvair

回答by Fábio Perez

回答by benjaminplanche

回答by Shital Shah

回答by prosti

回答by prosti

相关推荐

最近更新

标签

Python 如何将自定义数据集拆分为训练数据集和测试数据集？

提问by nirvair

回答by Fábio Perez

回答by benjaminplanche

回答by Shital Shah

回答by prosti

回答by prosti

相关推荐

Python OpenCV 错误 - cv2.cvtcolor

python - 在日志中遇到无效值

Python 如何将列添加到空的熊猫数据框中？

Python/Pandas Dataframe 用中值替换 0

相关推荐

最近更新

标签