Python 如何以干净有效的方式在 pytorch 中获得小批量?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45113245/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get mini-batches in pytorch in a clean and efficient way?
提问by Charlie Parker
I was trying to do a simple thing which was train a linear model with Stochastic Gradient Descent (SGD) using torch:
我试图做一个简单的事情,即使用 Torch 使用随机梯度下降 (SGD) 训练线性模型:
import numpy as np
import torch
from torch.autograd import Variable
import pdb
def get_batch2(X,Y,M,dtype):
X,Y = X.data.numpy(), Y.data.numpy()
N = len(Y)
valid_indices = np.array( range(N) )
batch_indices = np.random.choice(valid_indices,size=M,replace=False)
batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
batch_ys = torch.FloatTensor(Y[batch_indices]).type(dtype)
return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
def poly_kernel_matrix( x,D ):
N = len(x)
Kern = np.zeros( (N,D+1) )
for n in range(N):
for d in range(D+1):
Kern[n,d] = x[n]**d;
return Kern
## data params
N=5 # data set size
Degree=4 # number dimensions/features
D_sgd = Degree+1
##
x_true = np.linspace(0,1,N) # the real data points
y = np.sin(2*np.pi*x_true)
y.shape = (N,1)
## TORCH
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
X_mdl = poly_kernel_matrix( x_true,Degree )
X_mdl = Variable(torch.FloatTensor(X_mdl).type(dtype), requires_grad=False)
y = Variable(torch.FloatTensor(y).type(dtype), requires_grad=False)
## SGD mdl
w_init = torch.zeros(D_sgd,1).type(dtype)
W = Variable(w_init, requires_grad=True)
M = 5 # mini-batch size
eta = 0.1 # step size
for i in range(500):
batch_xs, batch_ys = get_batch2(X_mdl,y,M,dtype)
# Forward pass: compute predicted y using operations on Variables
y_pred = batch_xs.mm(W)
# Compute and print loss using operations on Variables. Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape (1,); loss.data[0] is a scalar value holding the loss.
loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
# Use autograd to compute the backward pass. Now w will have gradients
loss.backward()
# Update weights using gradient descent; w1.data are Tensors,
# w.grad are Variables and w.grad.data are Tensors.
W.data -= eta * W.grad.data
# Manually zero the gradients after updating weights
W.grad.data.zero_()
#
c_sgd = W.data.numpy()
X_mdl = X_mdl.data.numpy()
y = y.data.numpy()
#
Xc_pinv = np.dot(X_mdl,c_sgd)
print('J(c_sgd) = ', (1/N)*(np.linalg.norm(y-Xc_pinv)**2) )
print('loss = ',loss.data[0])
the code runs fine and all though my get_batch2
method seems really dum/naive, its probably because I am new to pytorch but I have not found a good place where they discuss how to retrieve data batches. I went through their tutorials (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html) and through the data set (http://pytorch.org/tutorials/beginner/data_loading_tutorial.html) with no luck. The tutorials all seem to assume that one already has the batch and batch-size at the beginning and then proceeds to train with that data without changing it (specifically look at http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-variables-and-autograd).
代码运行良好,尽管我的get_batch2
方法看起来很愚蠢/幼稚,这可能是因为我是 pytorch 的新手,但我还没有找到他们讨论如何检索数据批次的好地方。我浏览了他们的教程(http://pytorch.org/tutorials/beginner/pytorch_with_examples.html)和数据集(http://pytorch.org/tutorials/beginner/data_loading_tutorial.html),但没有运气。这些教程似乎都假设一开始就已经有了批次和批次大小,然后继续使用该数据进行训练而不更改它(具体请参见 http://pytorch.org/tutorials/beginner/pytorch_with_examples.html# pytorch-variables-and-autograd)。
So my question is do I really need to turn my data back into numpy so that I can fetch some random sample of it and then turn it back to pytorch with Variable to be able to train in memory? Is there no way to get mini-batches with torch?
所以我的问题是我真的需要将我的数据转换回 numpy 以便我可以获取它的一些随机样本,然后将它转换回带有变量的 pytorch 以便能够在内存中进行训练吗?有没有办法用火炬获得小批量?
I looked at a few functions torch provides but with no luck:
我查看了 Torch 提供的一些功能,但没有运气:
#pdb.set_trace()
#valid_indices = torch.arange(0,N).numpy()
#valid_indices = np.array( range(N) )
#batch_indices = np.random.choice(valid_indices,size=M,replace=False)
#indices = torch.LongTensor(batch_indices)
#batch_xs, batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
#batch_xs,batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
even though the code I provided works fine I am worried that its not an efficient implementation AND that if I were to use GPUs that there would be a considerable further slow down (because my guess it putting things in memory and then fetching them back to put them GPU like that is silly).
即使我提供的代码工作正常,我也担心它不是一个有效的实现,而且如果我使用 GPU 会进一步减慢速度(因为我猜它把东西放在内存中,然后把它们取回放他们这样的 GPU 很愚蠢)。
I implemented a new one based on the answer that suggested to use torch.index_select()
:
我根据建议使用的答案实现了一个新的torch.index_select()
:
def get_batch2(X,Y,M):
'''
get batch for pytorch model
'''
# TODO fix and make it nicer, there is pytorch forum question
#X,Y = X.data.numpy(), Y.data.numpy()
X,Y = X, Y
N = X.size()[0]
batch_indices = torch.LongTensor( np.random.randint(0,N+1,size=M) )
pdb.set_trace()
batch_xs = torch.index_select(X,0,batch_indices)
batch_ys = torch.index_select(Y,0,batch_indices)
return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
however, this seems to have issues because it does not work if X,Y
are NOT variables...which is really odd. I added this to the pytorch forum: https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322
然而,这似乎有问题,因为如果X,Y
不是变量,它就不起作用……这真的很奇怪。我将此添加到 pytorch 论坛:https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322
Right now what I am struggling with is making this work for gpu. My most current version:
现在我正在努力使这项工作适用于 gpu。我的最新版本:
def get_batch2(X,Y,M,dtype):
'''
get batch for pytorch model
'''
# TODO fix and make it nicer, there is pytorch forum question
#X,Y = X.data.numpy(), Y.data.numpy()
X,Y = X, Y
N = X.size()[0]
if dtype == torch.cuda.FloatTensor:
batch_indices = torch.cuda.LongTensor( np.random.randint(0,N,size=M) )# without replacement
else:
batch_indices = torch.LongTensor( np.random.randint(0,N,size=M) ).type(dtype) # without replacement
pdb.set_trace()
batch_xs = torch.index_select(X,0,batch_indices)
batch_ys = torch.index_select(Y,0,batch_indices)
return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
the error:
错误:
RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)
I don't get it, do I really have to do:
我不明白,我真的必须这样做:
ints = [ random.randint(0,N) for i i range(M)]
to get the integers?
得到整数?
It would also be ideal if the data could be a variable. It seems that it torch.index_select
does not work for Variable
type data.
如果数据可以是变量,那也是理想的。似乎它torch.index_select
不适用于Variable
类型数据。
this list of integers thing still doesn't work:
这个整数列表仍然不起作用:
TypeError: torch.addmm received an invalid combination of arguments - got (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor), but expected one of:
* (torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (float beta, torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (float beta, torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
* (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
* (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
回答by Mo Hossny
Use data loaders.
使用数据加载器。
Data Set
数据集
First you define a dataset. You can use packages datasets in torchvision.datasets
or use ImageFolder
dataset class which follows the structure of Imagenet.
首先定义一个数据集。您可以使用包中的数据集torchvision.datasets
或使用ImageFolder
遵循 Imagenet 结构的数据集类。
trainset=torchvision.datasets.ImageFolder(root='/path/to/your/data/trn', transform=generic_transform)
testset=torchvision.datasets.ImageFolder(root='/path/to/your/data/val', transform=generic_transform)
Transforms
变换
Transforms are very useful for preprocessing loaded data on the fly. If you are using images, you have to use the ToTensor()
transform to convert loaded images from PIL
to torch.tensor
. More transforms can be packed into a composit transform as follows.
转换对于动态预处理加载的数据非常有用。如果您使用的图片,你必须使用ToTensor()
转换为加载的图像从转换PIL
到torch.tensor
。可以将更多变换打包到复合变换中,如下所示。
generic_transform = transforms.Compose([
transforms.ToTensor(),
transforms.ToPILImage(),
#transforms.CenterCrop(size=128),
transforms.Lambda(lambda x: myimresize(x, (128, 128))),
transforms.ToTensor(),
transforms.Normalize((0., 0., 0.), (6, 6, 6))
])
Data Loader
数据加载器
Then you define a data loader which prepares the next batch while training. You can set number of threads for data loading.
然后定义一个数据加载器,它在训练时准备下一批。您可以设置用于数据加载的线程数。
trainloader=torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8)
testloader=torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=8)
For training, you just enumerate on the data loader.
对于训练,您只需枚举数据加载器。
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
# continue training...
NumPy Stuff
NumPy 的东西
Yes. You have to convert torch.tensor
to numpy
using .numpy()
method to work on it. If you are using CUDA you have to download the data from GPU to CPU first using the .cpu()
method before calling .numpy()
. Personally, coming from MATLAB background, I prefer to do most of the work with torch tensor, then convert data to numpy only for visualisation. Also bear in mind that torch stores data in a channel-first mode while numpy and PIL work with channel-last. This means you need to use np.rollaxis
to move the channel axis to the last. A sample code is below.
是的。您必须转换torch.tensor
为numpy
using.numpy()
方法才能处理它。如果您使用的是 CUDA,则必须先使用.cpu()
方法将数据从 GPU 下载到 CPU,然后再调用.numpy()
. 就个人而言,来自 MATLAB 背景,我更喜欢使用 Torch tensor 完成大部分工作,然后将数据转换为 numpy 仅用于可视化。还要记住,torch 以通道优先模式存储数据,而 numpy 和 PIL 以通道最后模式工作。这意味着您需要使用np.rollaxis
将通道轴移动到最后。示例代码如下。
np.rollaxis(make_grid(mynet.ftrextractor(inputs).data, nrow=8, padding=1).cpu().numpy(), 0, 3)
Logging
日志记录
The best method I found to visualise the feature maps is using tensor board. A code is available at yunjey/pytorch-tutorial.
我发现可视化特征图的最佳方法是使用张量板。代码可在yunjey/pytorch-tutorial 获得。
回答by Forcetti
Not sure what you were trying to do. W.r.t. batching you wouldn't have to convert to numpy. You could just use index_select(), e.g.:
不确定你想做什么。Wrt 批处理您不必转换为 numpy。你可以只使用index_select(),例如:
for epoch in range(500):
k=0
loss = 0
while k < X_mdl.size(0):
random_batch = [0]*5
for i in range(k,k+M):
random_batch[i] = np.random.choice(N-1)
random_batch = torch.LongTensor(random_batch)
batch_xs = X_mdl.index_select(0, random_batch)
batch_ys = y.index_select(0, random_batch)
# Forward pass: compute predicted y using operations on Variables
y_pred = batch_xs.mul(W)
# etc..
The rest of the code would have to be changed as well though.
其余的代码也必须更改。
My guess, you would like to create a get_batch function that concatenates your X tensors and Y tensors. Something like:
我的猜测是,您想创建一个 get_batch 函数来连接 X 张量和 Y 张量。就像是:
def make_batch(list_of_tensors):
X, y = list_of_tensors[0]
# may need to unsqueeze X and y to get right dimensions
for i, (sample, label) in enumerate(list_of_tensors[1:]):
X = torch.cat((X, sample), dim=0)
y = torch.cat((y, label), dim=0)
return X, y
Then during training you select, e.g. max_batch_size = 32, examples through slicing.
然后在训练期间您通过切片选择,例如 max_batch_size = 32,示例。
for epoch:
X, y = make_batch(list_of_tensors)
X = Variable(X, requires_grad=False)
y = Variable(y, requires_grad=False)
k = 0
while k < X.size(0):
inputs = X[k:k+max_batch_size,:]
labels = y[k:k+max_batch_size,:]
# some computation
k+= max_batch_size
回答by saetch_g
If I'm understanding your code correctly, your get_batch2
function appears to be taking random mini-batches from your dataset without tracking which indices you've used already in an epoch. The issue with this implementation is that it likely will not make use of all of your data.
如果我正确理解您的代码,您的get_batch2
函数似乎是从您的数据集中随机抽取小批量,而没有跟踪您在一个时期中已经使用的索引。此实现的问题在于它可能不会使用您的所有数据。
The way I usually do batching is creating a random permutation of all the possible vertices using torch.randperm(N)
and loop through them in batches. For example:
我通常做批处理的方式是创建所有可能顶点的随机排列,torch.randperm(N)
并批量循环遍历它们。例如:
n_epochs = 100 # or whatever
batch_size = 128 # or whatever
for epoch in range(n_epochs):
# X is a torch Variable
permutation = torch.randperm(X.size()[0])
for i in range(0,X.size()[0], batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y = X[indices], Y[indices]
# in case you wanted a semi-full example
outputs = model.forward(batch_x)
loss = lossfunction(outputs,batch_y)
loss.backward()
optimizer.step()
If you like to copy and paste, make sure you define your optimizer, model, and lossfunction somewhere before the start of the epoch loop.
如果您喜欢复制和粘贴,请确保在 epoch 循环开始之前的某处定义优化器、模型和损失函数。
With regards to your error, try using torch.from_numpy(np.random.randint(0,N,size=M)).long()
instead of torch.LongTensor(np.random.randint(0,N,size=M))
. I'm not sure if this will solve the error you are getting, but it will solve a future error.
关于您的错误,请尝试使用torch.from_numpy(np.random.randint(0,N,size=M)).long()
而不是torch.LongTensor(np.random.randint(0,N,size=M))
。我不确定这是否会解决您遇到的错误,但它会解决未来的错误。
回答by gary69
Create a class that is a subclass of torch.utils.data.Dataset
and pass it to a torch.utils.data.Dataloader
. Below is an example for my project.
创建一个作为 的子类的类torch.utils.data.Dataset
并将其传递给torch.utils.data.Dataloader
. 下面是我的项目的一个例子。
class CandidateDataset(Dataset):
def __init__(self, x, y):
self.len = x.shape[0]
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'
self.x_data = torch.as_tensor(x, device=device, dtype=torch.float)
self.y_data = torch.as_tensor(y, device=device, dtype=torch.long)
def __getitem__(self, index):
return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
def fit(self, candidate_count):
feature_matrix = np.empty(shape=(candidate_count, 600))
target_matrix = np.empty(shape=(candidate_count, 1))
fill_matrices(feature_matrix, target_matrix)
candidate_ds = CandidateDataset(feature_matrix, target_matrix)
train_loader = DataLoader(dataset = candidate_ds, batch_size = self.BATCH_SIZE, shuffle = True)
for epoch in range(self.N_EPOCHS):
print('starting epoch ' + str(epoch))
for batch_idx, (inputs, labels) in enumerate(train_loader):
print('starting batch ' + str(batch_idx) + ' epoch ' + str(epoch))
inputs, labels = Variable(inputs), Variable(labels)
self.optimizer.zero_grad()
inputs = inputs.view(1, inputs.size()[0], 600)
# init hidden with number of rows in input
y_pred = self.model(inputs, self.model.initHidden(inputs.size()[1]))
labels.squeeze_()
# labels should be tensor with batch_size rows. Column the index of the class (0 or 1)
loss = self.loss_f(y_pred, labels)
loss.backward()
self.optimizer.step()
print('done batch ' + str(batch_idx) + ' epoch ' + str(epoch))
回答by Jibin Mathew
You can use torch.utils.data
您可以使用 torch.utils.data
assuming you have loaded the data from the directory, in train and test numpy arrays, you can inherit from torch.utils.data.Dataset
class to create your dataset object
假设您已经从目录中加载了数据,在训练和测试 numpy 数组中,您可以从torch.utils.data.Dataset
类继承以创建您的数据集对象
class MyDataset(Dataset):
def __init__(self, x, y):
super(MyDataset, self).__init__()
assert x.shape[0] == y.shape[0] # assuming shape[0] = dataset size
self.x = x
self.y = y
def __len__(self):
return self.y.shape[0]
def __getitem__(self, index):
return self.x[index], self.y[index]
Then, create your dataset object
然后,创建您的数据集对象
traindata = MyDataset(train_x, train_y)
Finally, use DataLoader
to create your mini-batches
最后,用于DataLoader
创建您的小批量
trainloader = torch.utils.data.DataLoader(traindata, batch_size=64, shuffle=True)