Python 如何解决nan损失?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40158633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:13:21  来源:igfitidea点击:

How to solve nan loss?

pythontensorflownan

提问by Swind D.C. Xu

Problem

问题

I'm running a Deep Neural Network on the MNIST where the loss defined as follow:

我在 MNIST 上运行一个深度神经网络,其中的损失定义如下:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, label))

The program seems to run correctly until I get a nan loss in the 10000+ th minibatch. Sometimes, the program runs correctly until it finished. I think tf.nn.softmax_cross_entropy_with_logitsis giving me this error. This is strange, because the code just contains muland addoperations.

该程序似乎可以正确运行,直到我在第 10000+ 个小批量中出现 nan 损失。有时,程序会正确运行,直到完成。我认为tf.nn.softmax_cross_entropy_with_logits是给我这个错误。这很奇怪,因为代码只是包含muladd操作。

Possible Solution

可能的解决方案

Maybe I can use:

也许我可以使用:

if cost == "nan":
  optimizer = an empty optimizer 
else:
  ...
  optimizer = real optimizer

But I cannot find the type of nan. How can I check a variable is nanor not?

但是我找不到nan. 如何检查变量是否nan存在?

How else can I solve this problem?

我还能如何解决这个问题?

回答by Ilyakom

Check your learning rate. The bigger your network, more parameters to learn. That means you also need to decrease the learning rate.

检查你的学习率。你的网络越大,需要学习的参数就越多。这意味着您还需要降低学习率。

回答by demianzhang

I find a similar problem here TensorFlow cross_entropy NaN problem

我在这里发现了一个类似的问题TensorFlow cross_entropy NaN question

Thanks to the author user1111929

感谢作者 user1111929

tf.nn.softmax_cross_entropy_with_logits => -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

实际上是一种计算交叉熵的可怕方法。在某些样本中,一段时间后可以肯定地排除某些类别,导致该样本的 y_conv=0。这通常不是问题,因为您对这些不感兴趣,但是按照 cross_entropy 在那里的编写方式,它会为该特定样本/类产生 0*log(0) 。因此是 NaN。

Replacing it with

将其替换为

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

Or

或者

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

Solved nan problem.

解决了nan问题。

回答by Greg K

The reason you are getting NaN's is most likely that somewhere in your cost function or softmax you are trying to take a log of zero, which is not a number. But to answer your specific question about detecting NaN, Python has a built-in capability to test for NaN in the math module. For example:

您得到 NaN 的原因很可能是在您的成本函数或 softmax 中的某个地方,您试图取零对数,这不是数字。但是要回答有关检测 NaN 的具体问题,Python 具有在 math 模块中测试 NaN 的内置功能。例如:

import math
val = float('nan')
val
if math.isnan(val):
    print('Detected NaN')
    import pdb; pdb.set_trace() # Break into debugger to look around

回答by Fematich

I don't have your code or data. But tf.nn.softmax_cross_entropy_with_logitsshould be stable with a valid probability distribution (more info here). I assume your data does not meet this requirement. An analogous problem was also discussed here. Which would lead you to either:

我没有你的代码或数据。但tf.nn.softmax_cross_entropy_with_logits应该是稳定的,具有有效的概率分布(更多信息在这里)。我假设您的数据不符合此要求。这里也讨论了一个类似的问题。这将导致您:

  1. Implement your own softmax_cross_entropy_with_logitsfunction, e.g. try (source):

    epsilon = tf.constant(value=0.00001, shape=shape)
    logits = logits + epsilon
    softmax = tf.nn.softmax(logits)
    cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
    
  2. Update your data so that it does have a valid probability distribution

  1. 实现您自己的softmax_cross_entropy_with_logits功能,例如 try ( source):

    epsilon = tf.constant(value=0.00001, shape=shape)
    logits = logits + epsilon
    softmax = tf.nn.softmax(logits)
    cross_entropy = -tf.reduce_sum(labels * tf.log(softmax), reduction_indices=[1])
    
  2. 更新您的数据,使其具有有效的概率分布