Python 如何解决nan损失?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40158633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to solve nan loss?
提问by Swind D.C. Xu
Problem
问题
I'm running a Deep Neural Network on the MNIST where the loss defined as follow:
我在 MNIST 上运行一个深度神经网络,其中的损失定义如下:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))
The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logits
is giving me this error.
This is strange, because the code just contains mul
and add
operations.
该程序似乎可以正确运行,直到我在第 10000+ 个小批量中出现 nan 损失。有时,程序会正确运行,直到完成。我认为tf.nn.softmax_cross_entropy_with_logits
是给我这个错误。这很奇怪,因为代码只是包含mul
和add
操作。
Possible Solution
可能的解决方案
Maybe I can use:
也许我可以使用:
if cost == "nan":
optimizer = an empty optimizer
else:
...
optimizer = real optimizer
But I cannot find the type of nan
. How can I check a variable is nan
or not?
但是我找不到nan
. 如何检查变量是否nan
存在?
How else can I solve this problem?
我还能如何解决这个问题?
回答by Ilyakom
Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.
检查你的学习率。你的网络越大,需要学习的参数就越多。这意味着您还需要降低学习率。
回答by demianzhang
I find a similar problem here TensorFlow cross_entropy NaN problem
我在这里发现了一个类似的问题TensorFlow cross_entropy NaN question
Thanks to the author user1111929
感谢作者 user1111929
tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))
is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.
实际上是一种计算交叉熵的可怕方法。在某些样本中,一段时间后可以肯定地排除某些类别,导致该样本的 y_conv=0。这通常不是问题,因为您对这些不感兴趣,但是按照 cross_entropy 在那里的编写方式,它会为该特定样本/类产生 0*log(0) 。因此是 NaN。
Replacing it with
将其替换为
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))
Or
或者
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))
Solved nan problem.
解决了nan问题。
回答by Greg K
The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:
您得到 NaN 的原因很可能是在您的成本函数或 softmax 中的某个地方,您试图取零对数,这不是数字。但是要回答有关检测 NaN 的具体问题,Python 具有在 math 模块中测试 NaN 的内置功能。例如:
import math
val = float('nan')
val
if math.isnan(val):
print('Detected NaN')
import pdb; pdb.set_trace() # Break into debugger to look around
回答by Fematich
I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logits
should be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:
我没有你的代码或数据。但tf.nn.softmax_cross_entropy_with_logits
应该是稳定的,具有有效的概率分布(更多信息在这里)。我假设您的数据不符合此要求。这里也讨论了一个类似的问题。这将导致您:
Implement your own
softmax_cross_entropy_with_logits
function, e.g. try (source):epsilon = tf.constant(value=0.00001, shape=shape) logits = logits + epsilon softmax = tf.nn.softmax(logits) cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
Update your data so that it does have a valid probability distribution
实现您自己的
softmax_cross_entropy_with_logits
功能,例如 try ( source):epsilon = tf.constant(value=0.00001, shape=shape) logits = logits + epsilon softmax = tf.nn.softmax(logits) cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
更新您的数据,使其具有有效的概率分布