Python 如何以干净有效的方式在 pytorch 中获得小批量?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45113245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:41:59  来源:igfitidea点击:

How to get mini-batches in pytorch in a clean and efficient way?

pythonnumpymachine-learningdeep-learningpytorch

提问by Charlie Parker

I was trying to do a simple thing which was train a linear model with Stochastic Gradient Descent (SGD) using torch:

我试图做一个简单的事情,即使用 Torch 使用随机梯度下降 (SGD) 训练线性模型:

import numpy as np

import torch
from torch.autograd import Variable

import pdb

def get_batch2(X,Y,M,dtype):
    X,Y = X.data.numpy(), Y.data.numpy()
    N = len(Y)
    valid_indices = np.array( range(N) )
    batch_indices = np.random.choice(valid_indices,size=M,replace=False)
    batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
    batch_ys = torch.FloatTensor(Y[batch_indices]).type(dtype)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

def poly_kernel_matrix( x,D ):
    N = len(x)
    Kern = np.zeros( (N,D+1) )
    for n in range(N):
        for d in range(D+1):
            Kern[n,d] = x[n]**d;
    return Kern

## data params
N=5 # data set size
Degree=4 # number dimensions/features
D_sgd = Degree+1
##
x_true = np.linspace(0,1,N) # the real data points
y = np.sin(2*np.pi*x_true)
y.shape = (N,1)
## TORCH
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
X_mdl = poly_kernel_matrix( x_true,Degree )
X_mdl = Variable(torch.FloatTensor(X_mdl).type(dtype), requires_grad=False)
y = Variable(torch.FloatTensor(y).type(dtype), requires_grad=False)
## SGD mdl
w_init = torch.zeros(D_sgd,1).type(dtype)
W = Variable(w_init, requires_grad=True)
M = 5 # mini-batch size
eta = 0.1 # step size
for i in range(500):
    batch_xs, batch_ys = get_batch2(X_mdl,y,M,dtype)
    # Forward pass: compute predicted y using operations on Variables
    y_pred = batch_xs.mm(W)
    # Compute and print loss using operations on Variables. Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape (1,); loss.data[0] is a scalar value holding the loss.
    loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
    # Use autograd to compute the backward pass. Now w will have gradients
    loss.backward()
    # Update weights using gradient descent; w1.data are Tensors,
    # w.grad are Variables and w.grad.data are Tensors.
    W.data -= eta * W.grad.data
    # Manually zero the gradients after updating weights
    W.grad.data.zero_()

#
c_sgd = W.data.numpy()
X_mdl = X_mdl.data.numpy()
y = y.data.numpy()
#
Xc_pinv = np.dot(X_mdl,c_sgd)
print('J(c_sgd) = ', (1/N)*(np.linalg.norm(y-Xc_pinv)**2) )
print('loss = ',loss.data[0])

the code runs fine and all though my get_batch2method seems really dum/naive, its probably because I am new to pytorch but I have not found a good place where they discuss how to retrieve data batches. I went through their tutorials (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html) and through the data set (http://pytorch.org/tutorials/beginner/data_loading_tutorial.html) with no luck. The tutorials all seem to assume that one already has the batch and batch-size at the beginning and then proceeds to train with that data without changing it (specifically look at http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-variables-and-autograd).

代码运行良好,尽管我的get_batch2方法看起来很愚蠢/幼稚,这可能是因为我是 pytorch 的新手,但我还没有找到他们讨论如何检索数据批次的好地方。我浏览了他们的教程(http://pytorch.org/tutorials/beginner/pytorch_with_examples.html)和数据集(http://pytorch.org/tutorials/beginner/data_loading_tutorial.html),但没有运气。这些教程似乎都假设一开始就已经有了批次和批次大小,然后继续使用该数据进行训练而不更改它(具体请参见 http://pytorch.org/tutorials/beginner/pytorch_with_examples.html# pytorch-variables-and-autograd)。

So my question is do I really need to turn my data back into numpy so that I can fetch some random sample of it and then turn it back to pytorch with Variable to be able to train in memory? Is there no way to get mini-batches with torch?

所以我的问题是我真的需要将我的数据转换回 numpy 以便我可以获取它的一些随机样本,然后将它转换回带有变量的 pytorch 以便能够在内存中进行训练吗?有没有办法用火炬获得小批量?

I looked at a few functions torch provides but with no luck:

我查看了 Torch 提供的一些功能,但没有运气:

#pdb.set_trace()
#valid_indices = torch.arange(0,N).numpy()
#valid_indices = np.array( range(N) )
#batch_indices = np.random.choice(valid_indices,size=M,replace=False)
#indices = torch.LongTensor(batch_indices)
#batch_xs, batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
#batch_xs,batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)

even though the code I provided works fine I am worried that its not an efficient implementation AND that if I were to use GPUs that there would be a considerable further slow down (because my guess it putting things in memory and then fetching them back to put them GPU like that is silly).

即使我提供的代码工作正常,我也担心它不是一个有效的实现,而且如果我使用 GPU 会进一步减慢速度(因为我猜它把东西放在内存中,然后把它们取回放他们这样的 GPU 很愚蠢)。



I implemented a new one based on the answer that suggested to use torch.index_select():

我根据建议使用的答案实现了一个新的torch.index_select()

def get_batch2(X,Y,M):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    #X,Y = X.data.numpy(), Y.data.numpy()
    X,Y = X, Y
    N = X.size()[0]
    batch_indices = torch.LongTensor( np.random.randint(0,N+1,size=M) )
    pdb.set_trace()
    batch_xs = torch.index_select(X,0,batch_indices)
    batch_ys = torch.index_select(Y,0,batch_indices)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

however, this seems to have issues because it does not work if X,Yare NOT variables...which is really odd. I added this to the pytorch forum: https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322

然而,这似乎有问题,因为如果X,Y不是变量,它就不起作用……这真的很奇怪。我将此添加到 pytorch 论坛:https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322

Right now what I am struggling with is making this work for gpu. My most current version:

现在我正在努力使这项工作适用于 gpu。我的最新版本:

def get_batch2(X,Y,M,dtype):
    '''
    get batch for pytorch model
    '''
    # TODO fix and make it nicer, there is pytorch forum question
    #X,Y = X.data.numpy(), Y.data.numpy()
    X,Y = X, Y
    N = X.size()[0]
    if dtype ==  torch.cuda.FloatTensor:
        batch_indices = torch.cuda.LongTensor( np.random.randint(0,N,size=M) )# without replacement
    else:
        batch_indices = torch.LongTensor( np.random.randint(0,N,size=M) ).type(dtype)  # without replacement
    pdb.set_trace()
    batch_xs = torch.index_select(X,0,batch_indices)
    batch_ys = torch.index_select(Y,0,batch_indices)
    return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)

the error:

错误:

RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)

I don't get it, do I really have to do:

我不明白,我真的必须这样做:

ints = [ random.randint(0,N) for i i range(M)]

to get the integers?

得到整数?

It would also be ideal if the data could be a variable. It seems that it torch.index_selectdoes not work for Variabletype data.

如果数据可以是变量,那也是理想的。似乎它torch.index_select不适用于Variable类型数据。

this list of integers thing still doesn't work:

这个整数列表仍然不起作用:

TypeError: torch.addmm received an invalid combination of arguments - got (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor), but expected one of:
 * (torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
 * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
      didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
 * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
      didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)

回答by Mo Hossny

Use data loaders.

使用数据加载器。

Data Set

数据集

First you define a dataset. You can use packages datasets in torchvision.datasetsor use ImageFolderdataset class which follows the structure of Imagenet.

首先定义一个数据集。您可以使用包中的数据集torchvision.datasets或使用ImageFolder遵循 Imagenet 结构的数据集类。

trainset=torchvision.datasets.ImageFolder(root='/path/to/your/data/trn', transform=generic_transform)
testset=torchvision.datasets.ImageFolder(root='/path/to/your/data/val', transform=generic_transform)

Transforms

变换

Transforms are very useful for preprocessing loaded data on the fly. If you are using images, you have to use the ToTensor()transform to convert loaded images from PILto torch.tensor. More transforms can be packed into a composit transform as follows.

转换对于动态预处理加载的数据非常有用。如果您使用的图片,你必须使用ToTensor()转换为加载的图像从转换PILtorch.tensor。可以将更多变换打包到复合变换中,如下所示。

generic_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.ToPILImage(),
    #transforms.CenterCrop(size=128),
    transforms.Lambda(lambda x: myimresize(x, (128, 128))),
    transforms.ToTensor(),
    transforms.Normalize((0., 0., 0.), (6, 6, 6))
])

Data Loader

数据加载器

Then you define a data loader which prepares the next batch while training. You can set number of threads for data loading.

然后定义一个数据加载器,它在训练时准备下一批。您可以设置用于数据加载的线程数。

trainloader=torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8)
testloader=torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=8)

For training, you just enumerate on the data loader.

对于训练,您只需枚举数据加载器。

  for i, data in enumerate(trainloader, 0):
    inputs, labels = data    
    inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
    # continue training...

NumPy Stuff

NumPy 的东西

Yes. You have to convert torch.tensorto numpyusing .numpy()method to work on it. If you are using CUDA you have to download the data from GPU to CPU first using the .cpu()method before calling .numpy(). Personally, coming from MATLAB background, I prefer to do most of the work with torch tensor, then convert data to numpy only for visualisation. Also bear in mind that torch stores data in a channel-first mode while numpy and PIL work with channel-last. This means you need to use np.rollaxisto move the channel axis to the last. A sample code is below.

是的。您必须转换torch.tensornumpyusing.numpy()方法才能处理它。如果您使用的是 CUDA,则必须先使用.cpu()方法将数据从 GPU 下载到 CPU,然后再调用.numpy(). 就个人而言,来自 MATLAB 背景,我更喜欢使用 Torch tensor 完成大部分工作,然后将数据转换为 numpy 仅用于可视化。还要记住,torch 以通道优先模式存储数据,而 numpy 和 PIL 以通道最后模式工作。这意味着您需要使用np.rollaxis将通道轴移动到最后。示例代码如下。

np.rollaxis(make_grid(mynet.ftrextractor(inputs).data, nrow=8, padding=1).cpu().numpy(), 0, 3)

Logging

日志记录

The best method I found to visualise the feature maps is using tensor board. A code is available at yunjey/pytorch-tutorial.

我发现可视化特征图的最佳方法是使用张量板。代码可在yunjey/pytorch-tutorial 获得

回答by Forcetti

Not sure what you were trying to do. W.r.t. batching you wouldn't have to convert to numpy. You could just use index_select(), e.g.:

不确定你想做什么。Wrt 批处理您不必转换为 numpy。你可以只使用index_select(),例如:

for epoch in range(500):
    k=0
    loss = 0
    while k < X_mdl.size(0):

        random_batch = [0]*5
        for i in range(k,k+M):
            random_batch[i] = np.random.choice(N-1)
        random_batch = torch.LongTensor(random_batch)
        batch_xs = X_mdl.index_select(0, random_batch)
        batch_ys = y.index_select(0, random_batch)

        # Forward pass: compute predicted y using operations on Variables
        y_pred = batch_xs.mul(W)
        # etc..

The rest of the code would have to be changed as well though.

其余的代码也必须更改。



My guess, you would like to create a get_batch function that concatenates your X tensors and Y tensors. Something like:

我的猜测是,您想创建一个 get_batch 函数来连接 X 张量和 Y 张量。就像是:

def make_batch(list_of_tensors):
    X, y = list_of_tensors[0]
    # may need to unsqueeze X and y to get right dimensions
    for i, (sample, label) in enumerate(list_of_tensors[1:]):
        X = torch.cat((X, sample), dim=0)
        y = torch.cat((y, label), dim=0)
    return X, y

Then during training you select, e.g. max_batch_size = 32, examples through slicing.

然后在训练期间您通过切片选择,例如 max_batch_size = 32,示例。

for epoch:
  X, y = make_batch(list_of_tensors)
  X = Variable(X, requires_grad=False)
  y = Variable(y, requires_grad=False)

  k = 0   
   while k < X.size(0):
     inputs = X[k:k+max_batch_size,:]
     labels = y[k:k+max_batch_size,:]
     # some computation
     k+= max_batch_size

回答by saetch_g

If I'm understanding your code correctly, your get_batch2function appears to be taking random mini-batches from your dataset without tracking which indices you've used already in an epoch. The issue with this implementation is that it likely will not make use of all of your data.

如果我正确理解您的代码,您的get_batch2函数似乎是从您的数据集中随机抽取小批量,而没有跟踪您在一个时期中已经使用的索引。此实现的问题在于它可能不会使用您的所有数据。

The way I usually do batching is creating a random permutation of all the possible vertices using torch.randperm(N)and loop through them in batches. For example:

我通常做批处理的方式是创建所有可能顶点的随机排列,torch.randperm(N)并批量循环遍历它们。例如:

n_epochs = 100 # or whatever
batch_size = 128 # or whatever

for epoch in range(n_epochs):

    # X is a torch Variable
    permutation = torch.randperm(X.size()[0])

    for i in range(0,X.size()[0], batch_size):
        optimizer.zero_grad()

        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X[indices], Y[indices]

        # in case you wanted a semi-full example
        outputs = model.forward(batch_x)
        loss = lossfunction(outputs,batch_y)

        loss.backward()
        optimizer.step()

If you like to copy and paste, make sure you define your optimizer, model, and lossfunction somewhere before the start of the epoch loop.

如果您喜欢复制和粘贴,请确保在 epoch 循环开始之前的某处定义优化器、模型和损失函数。

With regards to your error, try using torch.from_numpy(np.random.randint(0,N,size=M)).long()instead of torch.LongTensor(np.random.randint(0,N,size=M)). I'm not sure if this will solve the error you are getting, but it will solve a future error.

关于您的错误,请尝试使用torch.from_numpy(np.random.randint(0,N,size=M)).long()而不是torch.LongTensor(np.random.randint(0,N,size=M))。我不确定这是否会解决您遇到的错误,但它会解决未来的错误。

回答by gary69

Create a class that is a subclass of torch.utils.data.Datasetand pass it to a torch.utils.data.Dataloader. Below is an example for my project.

创建一个作为 的子类的类torch.utils.data.Dataset并将其传递给torch.utils.data.Dataloader. 下面是我的项目的一个例子。

class CandidateDataset(Dataset):
    def __init__(self, x, y):
        self.len = x.shape[0]
        if torch.cuda.is_available():
            device = 'cuda'
        else:
            device = 'cpu'
        self.x_data = torch.as_tensor(x, device=device, dtype=torch.float)
        self.y_data = torch.as_tensor(y, device=device, dtype=torch.long)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    def __len__(self):
        return self.len

def fit(self, candidate_count):
        feature_matrix = np.empty(shape=(candidate_count, 600))
        target_matrix = np.empty(shape=(candidate_count, 1))
        fill_matrices(feature_matrix, target_matrix)
        candidate_ds = CandidateDataset(feature_matrix, target_matrix)
        train_loader = DataLoader(dataset = candidate_ds, batch_size = self.BATCH_SIZE, shuffle = True)
        for epoch in range(self.N_EPOCHS):
            print('starting epoch ' + str(epoch))
            for batch_idx, (inputs, labels) in enumerate(train_loader):
                print('starting batch ' + str(batch_idx) + ' epoch ' + str(epoch))
                inputs, labels = Variable(inputs), Variable(labels)
                self.optimizer.zero_grad()
                inputs = inputs.view(1, inputs.size()[0], 600)
                # init hidden with number of rows in input
                y_pred = self.model(inputs, self.model.initHidden(inputs.size()[1]))
                labels.squeeze_()
                # labels should be tensor with batch_size rows. Column the index of the class (0 or 1)
                loss = self.loss_f(y_pred, labels)
                loss.backward()
                self.optimizer.step()
                print('done batch ' + str(batch_idx) + ' epoch ' + str(epoch))

回答by Jibin Mathew

You can use torch.utils.data

您可以使用 torch.utils.data

assuming you have loaded the data from the directory, in train and test numpy arrays, you can inherit from torch.utils.data.Datasetclass to create your dataset object

假设您已经从目录中加载了数据,在训练和测试 numpy 数组中,您可以从torch.utils.data.Dataset类继承以创建您的数据集对象

class MyDataset(Dataset):
    def __init__(self, x, y):
        super(MyDataset, self).__init__()
        assert x.shape[0] == y.shape[0] # assuming shape[0] = dataset size
        self.x = x
        self.y = y


    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, index):
        return self.x[index], self.y[index]

Then, create your dataset object

然后,创建您的数据集对象

traindata = MyDataset(train_x, train_y)

Finally, use DataLoaderto create your mini-batches

最后,用于DataLoader创建您的小批量

trainloader = torch.utils.data.DataLoader(traindata, batch_size=64, shuffle=True)