Python 为什么我们需要在 PyTorch 中调用 zero_grad()?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48001598/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why do we need to call zero_grad() in PyTorch?
提问by user1424739
The method zero_grad()
needs to be called during training. But the documentationis not very helpful
该方法zero_grad()
需要在训练期间调用。但是文档不是很有帮助
| zero_grad(self)
| Sets gradients of all model parameters to zero.
Why do we need to call this method?
为什么我们需要调用这个方法?
回答by kmario23
In PyTorch
, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradientson subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradientson every loss.backward()
call.
在 中PyTorch
,我们需要在开始进行反向传播之前将梯度设置为零,因为 PyTorch会在后续的反向传播中累积梯度。这在训练 RNN 时很方便。因此,默认操作是在每次调用时累积(即求和)梯度loss.backward()
。
Because of this, when you start your training loop, ideally you should zero out the gradients
so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum(or maximum, in case of maximization objectives).
因此,当您开始训练循环时,理想情况下您应该zero out the gradients
正确地进行参数更新。否则梯度将指向其他方向,而不是朝向最小值(或最大值,在最大化目标的情况下)的预期方向。
Here is a simple example:
这是一个简单的例子:
import torch
from torch.autograd import Variable
import torch.optim as optim
def linear_model(x, W, b):
return torch.matmul(x, W) + b
data, targets = ...
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
optimizer = optim.Adam([W, b])
for sample, target in zip(data, targets):
# clear out the gradients of all Variables
# in this optimizer (i.e. W, b)
optimizer.zero_grad()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
optimizer.step()
Alternatively, if you're doing a vanilla gradient descent, then:
或者,如果您正在执行香草梯度下降,则:
W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)
for sample, target in zip(data, targets):
# clear out the gradients of Variables
# (i.e. W, b)
W.grad.data.zero_()
b.grad.data.zero_()
output = linear_model(sample, W, b)
loss = (output - target) ** 2
loss.backward()
W -= learning_rate * W.grad.data
b -= learning_rate * b.grad.data
Note: The accumulation(i.e. sum) of gradients happen when .backward()
is called on the loss
tensor.
注意:梯度的累积(即总和).backward()
在loss
张量上调用时发生。