Python 如何修复这个奇怪的错误:“运行时错误:CUDA 错误:内存不足”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54374935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to fix this strange error: "RuntimeError: CUDA error: out of memory"
提问by xiaoding chen
I ran a code about the deep learning network,first I trained the network,and it works well,but this error occurs when running to the validate network.
我运行了一个关于深度学习网络的代码,首先我训练了网络,它运行良好,但是运行到验证网络时出现此错误。
I have five epoch,every epoch has a process of training and validation. I met the error when validate in the first epoch. So I don not run the validate code, I found that code can run to the second epoch and have no error.
我有五个 epoch,每个 epoch 都有一个训练和验证的过程。我在第一个纪元验证时遇到了错误。所以我没有运行验证代码,我发现代码可以运行到第二个纪元并且没有错误。
My code:
我的代码:
for epoch in range(10,15): # epoch: 10~15
if(options["training"]["train"]):
trainer.epoch(model, epoch)
if(options["validation"]["validate"]):
#if(epoch == 14):
validator.epoch(model)
I feel the code of validation may have some bugs. But I can not find that.
我觉得验证代码可能有一些错误。但我找不到那个。
回答by K. Khanda
The error, which you has provided is shown, because you ran out of memory on your GPU. A way to solve it is to reduce the batch size until your code will run without this error.
显示了您提供的错误,因为您的 GPU 内存不足。解决此问题的一种方法是减少批处理大小,直到您的代码运行时不会出现此错误。
回答by YoungMin Park
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
1..当你只执行验证而不是训练时,
你不需要计算前向和后向阶段的梯度。
在这种情况下,您的代码可以位于
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
上面的代码不使用 GPU 内存
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
2.. 如果你在代码中使用 += 操作符,
它可以在你的梯度图中连续累积梯度。
在这种情况下,您需要像以下站点一样使用 float()
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-记忆
Even if docs guides with float(), in case of me, item() also worked like
即使文档使用 float() 进行指南,在我的情况下,item() 也像
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
3..如果在训练代码中使用for循环,
数据可以持续到整个for循环结束。
因此,在这种情况下,您可以在执行 optimizer.step() 后显式删除变量
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...
回答by Alessandro Suglia
It might be for a number of reasons that I try to report in the following list:
我尝试在以下列表中报告可能出于多种原因:
- Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
- RNN decoder maximum steps: if you're using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
- Tensors usage: minimise the number of tensors that you create. The garbage collector won't release them until they go out of scope.
- Batch size: incrementally increase your batch size until you go out of memory. It's a common trick that even famous library implement (see the
biggest_batch_first
description for the BucketIteratorin AllenNLP.
- 模块参数:检查模块的维数。将大输入张量(例如,大小为 1000)转换为另一个大输出张量(例如,大小为 1000)的线性层将需要一个大小为 (1000, 1000) 的矩阵。
- RNN 解码器最大步数:如果您在架构中使用 RNN 解码器,请避免循环大量步骤。通常,您会修复给定数量的对您的数据集合理的解码步骤。
- 张量使用:最小化您创建的张量数量。垃圾收集器在它们超出范围之前不会释放它们。
- 批量大小:逐步增加批量大小,直到内存不足。这是一个甚至著名的库都实现的常见技巧(参见AllenNLP中 BucketIterator的
biggest_batch_first
描述。
In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
此外,我建议您查看 PyTorch 官方文档:https://pytorch.org/docs/stable/notes/faq.html