Python 如何解决nan损失？

Question

提问by Swind D.C. Xu

Problem

问题

I'm running a Deep Neural Network on the MNIST where the loss defined as follow:

我在 MNIST 上运行一个深度神经网络，其中的损失定义如下：

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))

The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logitsis giving me this error. This is strange, because the code just contains muland addoperations.

该程序似乎可以正确运行，直到我在第 10000+ 个小批量中出现 nan 损失。有时，程序会正确运行，直到完成。我认为tf.nn.softmax_cross_entropy_with_logits是给我这个错误。这很奇怪，因为代码只是包含mul和add操作。

Possible Solution

可能的解决方案

Maybe I can use:

也许我可以使用：

if cost == "nan":
  optimizer = an empty optimizer 
else:
  ...
  optimizer = real optimizer

But I cannot find the type of nan. How can I check a variable is nanor not?

但是我找不到nan. 如何检查变量是否nan存在？

How else can I solve this problem?

我还能如何解决这个问题？

Answer 1

回答by Ilyakom

Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.

检查你的学习率。你的网络越大，需要学习的参数就越多。这意味着您还需要降低学习率。

Answer 2

回答by demianzhang

I find a similar problem here TensorFlow cross_entropy NaN problem

我在这里发现了一个类似的问题TensorFlow cross_entropy NaN question

Thanks to the author user1111929

感谢作者 user1111929

tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

实际上是一种计算交叉熵的可怕方法。在某些样本中，一段时间后可以肯定地排除某些类别，导致该样本的 y_conv=0。这通常不是问题，因为您对这些不感兴趣，但是按照 cross_entropy 在那里的编写方式，它会为该特定样本/类产生 0*log(0) 。因此是 NaN。

Replacing it with

将其替换为

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

Or

或者

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Solved nan problem.

解决了nan问题。

Answer 3

回答by Greg K

The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:

您得到 NaN 的原因很可能是在您的成本函数或 softmax 中的某个地方，您试图取零对数，这不是数字。但是要回答有关检测 NaN 的具体问题，Python 具有在 math 模块中测试 NaN 的内置功能。例如：

import math
val = float('nan')
val
if math.isnan(val):
    print('Detected NaN')
    import pdb; pdb.set_trace() # Break into debugger to look around

Answer 4

回答by Fematich

I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logitsshould be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:

我没有你的代码或数据。但tf.nn.softmax_cross_entropy_with_logits应该是稳定的，具有有效的概率分布（更多信息在这里）。我假设您的数据不符合此要求。这里也讨论了一个类似的问题。这将导致您：

Implement your own softmax_cross_entropy_with_logitsfunction, e.g. try (source):

epsilon = tf.constant(value=0.00001, shape=shape)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])

Update your data so that it does have a valid probability distribution

实现您自己的softmax_cross_entropy_with_logits功能，例如 try ( source)：

epsilon = tf.constant(value=0.00001, shape=shape)
logits = logits + epsilon
softmax = tf.nn.softmax(logits)
cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])

更新您的数据，使其具有有效的概率分布

Python 如何解决nan损失？

提问by Swind D.C. Xu

Problem

问题

Possible Solution

可能的解决方案

回答by Ilyakom

回答by demianzhang

回答by Greg K

回答by Fematich

相关推荐

最近更新

标签

Python 如何解决nan损失？

提问by Swind D.C. Xu

Problem

问题

Possible Solution

可能的解决方案

回答by Ilyakom

回答by demianzhang

回答by Greg K

回答by Fematich

相关推荐

Python matplotlib 轴上的不同精度

检测Python字符串是数字还是字母

Python 和 Anaconda 之间的混淆

使用 python 测试机器人框架工作中测试套件中每个测试用例的设置和拆卸

相关推荐

最近更新

标签