Python 如何在 TensorFlow 中调试 NaN 值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38810424/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How does one debug NaN values in TensorFlow?
提问by Pinocchio
I was running TensorFlow and I happen to have something yielding a NaN. I'd like to know what it is but I do not know how to do this. The main issue is that in a "normal" procedural program I would just write a print statement just before the operation is executed. The issue with TensorFlow is that I cannot do that because I first declare (or define) the graph, so adding print statements to the graph definition does not help. Are there any rules, advice, heuristics, anything to track down what might be causing the NaN?
我正在运行 TensorFlow,我碰巧有一些产生 NaN 的东西。我想知道它是什么,但我不知道如何做到这一点。主要问题是,在“正常”过程程序中,我只会在执行操作之前编写一个打印语句。TensorFlow 的问题是我不能这样做,因为我首先声明(或定义)了图形,因此向图形定义添加打印语句无济于事。是否有任何规则、建议、启发式方法或任何方法可以追踪可能导致 NaN 的原因?
In this case I know more precisely what line to look at because I have the following:
在这种情况下,我更准确地知道要查看哪一行,因为我有以下几点:
Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z)
when this line is present I have it that it returns NaN as declared by my summary writers. Why is this? Is there a way to at least explore what value Z has after its being square rooted?
当这一行出现时,我知道它返回 NaN,正如我的摘要作者所声明的那样。为什么是这样?有没有办法至少探索 Z 平方根后的值?
For the specific example I posted, I tried tf.Print(0,Z)
but with no success it printed nothing. As in:
对于我发布的特定示例,我尝试过tf.Print(0,Z)
但没有成功,它什么也没打印。如:
Delta_tilde = 2.0*tf.matmul(x,W) - tf.add(WW, XX) #note this quantity should always be positive because its pair-wise euclidian distance
Z = tf.sqrt(Delta_tilde)
tf.Print(0,[Z]) # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
A = tf.exp(Z)
I actually don't understand what tf.Print
is suppose to do. Why does it need two arguments? If I want to print 1 tensor why would I need to pass 2? Seems bizarre to me.
我实际上不明白tf.Print
应该做什么。为什么需要两个参数?如果我想打印 1 张量,为什么我需要通过 2?对我来说似乎很奇怪。
I was looking at the function tf.add_check_numerics_ops()but it doesn't say how to use it (plus the docs seem to not be super helpful). Does anyone know how to use this?
我正在查看函数tf.add_check_numerics_ops()但它没有说明如何使用它(而且文档似乎不是很有帮助)。有谁知道如何使用这个?
Since I've had comments addressing the data might be bad, I am using standard MNIST. However, I am computing a quantity that is positive (pair-wise eucledian distance) and then square rooting it. Thus, I wouldn't see how the data specifically would be an issue.
由于我已经收到了关于数据可能不好的评论,因此我使用的是标准的 MNIST。但是,我正在计算一个正数(成对欧几里德距离),然后将其平方根。因此,我不会看到数据是如何成为问题的。
回答by Phillip Bock
There are a couple of reasons WHY you can get a NaN-result, often it is because of too high a learning rate but plenty other reasons are possible like for example corrupt data in your input-queue or a log of 0 calculation.
获得 NaN 结果的原因有几个,通常是因为学习率太高,但还有很多其他原因,例如输入队列中的损坏数据或 0 计算的日志。
Anyhow, debugging with a print as you describe cannot be done by a simple print (as this would result only in the printing of the tensor-information inside the graph and not print any actual values).
无论如何,不能通过简单的打印来完成您描述的打印调试(因为这只会导致打印图形内的张量信息,而不打印任何实际值)。
However, if you use tf.print as an op in bulding the graph (tf.print) then when the graph gets executed you will get the actual values printed (and it IS a good exercise to watch these values to debug and understand the behavior of your net).
但是,如果您使用 tf.print 作为构建图形(tf.print)的操作,那么当图形被执行时,您将获得打印的实际值(观察这些值以调试和理解行为是一个很好的练习)你的网)。
However, you are using the print-statement not entirely in the correct manner. This is an op, so you need to pass it a tensor and request a result-tensor that you need to work with later on in the executing graph. Otherwise the op is not going to be executed and no printing occurs. Try this:
但是,您使用的打印语句并不完全正确。这是一个操作,因此您需要向它传递一个张量并请求一个您稍后需要在执行图中使用的结果张量。否则操作将不会被执行并且不会发生打印。尝试这个:
Z = tf.sqrt(Delta_tilde)
Z = tf.Print(Z,[Z], message="my Z-values:") # <-------- TF PRINT STATMENT
Z = Transform(Z) # potentially some transform, currently I have it to return Z for debugging (the identity)
Z = tf.pow(Z, 2.0)
回答by Lerner Zhang
I used to find it's much tougher to pinpoint where the nans and infs may occur than to fix the bug. As a complementary to @scai's answer, I'd like to add some points here:
我曾经发现确定 nans 和 infs 可能发生的位置比修复错误要困难得多。作为对@scai 回答的补充,我想在这里补充几点:
The debug module, you can imported by:
调试模块,您可以通过以下方式导入:
from tensorflow.python import debug as tf_debug
is much better than any print or assert.
比任何打印或断言要好得多。
You can just add the debug function by changing your wrapper you session by:
您可以通过更改会话的包装器来添加调试功能:
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
And you'll prompt an command line interface, then you enter:
run -f has_inf_or_nan
and lt -f has_inf_or_nan
to find where the nans or infs are. The first one is the first place where the catastrophe occurs. By the variable name you can trace the origin in your code.
然后你会提示一个命令行界面,然后你输入:
run -f has_inf_or_nan
并lt -f has_inf_or_nan
找到 nans 或 infs 的位置。第一个是灾难发生的第一个地方。通过变量名称,您可以跟踪代码中的来源。
Reference: https://developers.googleblog.com/2017/02/debug-tensorflow-models-with-tfdbg.html
参考:https: //developers.googleblog.com/2017/02/debug-tensorflow-models-with-tfdbg.html
回答by chasep255
It look like you can call it after you complete making the graph.
看起来你可以在完成图表制作后调用它。
check = tf.add_check_numerics_ops()
check = tf.add_check_numerics_ops()
I think this will add the check for all floating point operations. Then in the sessions run function you can add the check operation.
我认为这将添加对所有浮点运算的检查。然后在会话运行功能中您可以添加检查操作。
sess.run([check, ...])
sess.run([check, ...])
回答by Shanqing Cai
As of version 0.12, TensorFlow is shipped with a builtin debugger called tfdbg
. It optimizes the workflow of debugging this type of bad-numerical-value issues (like inf
and nan
). The documentation is at:
https://www.tensorflow.org/programmers_guide/debugger
从 0.12 版开始,TensorFlow 附带了一个名为tfdbg
. 它优化了调试此类错误数值问题(如inf
和nan
)的工作流程。文档位于:https:
//www.tensorflow.org/programmers_guide/debugger
回答by Alex Joz
First of all, you need to check you input data properly. In most cases this is the reason. But not always, of course.
首先,您需要检查您输入的数据是否正确。在大多数情况下,这就是原因。但并非总是如此,当然。
I usually use Tensorboard to see whats happening while training. So you can see the values on each step with
我通常使用 Tensorboard 来查看训练时发生的情况。所以你可以看到每一步的值
Z = tf.pow(Z, 2.0)
summary_z = tf.scalar_summary('z', Z)
#etc..
summary_merge = tf.merge_all_summaries()
#on each desired step save:
summary_str = sess.run(summary_merge)
summary_writer.add_summary(summary_str, i)
Also you can simply eval and print the current value:
您也可以简单地评估并打印当前值:
print(sess.run(Z))
回答by Yuq Wang
Current implementation of tfdbg.has_inf_or_nan
seems do not break immediately on hitting any tensor containing NaN
. When it does stop, the huge list of tensors displayed are notsorted in order of its execution.
A possible hack to find the first appearance of Nan
s is to dump all tensors to a temporary directory and inspect afterwards.
Here is a quick-and-dirty exampleto do that. (Assuming the NaN
s appear in the first few runs)
目前的实现tfdbg.has_inf_or_nan
似乎不会在击中任何包含NaN
. 当它停止时,显示的巨大张量列表不会按执行顺序排序。找到Nan
s第一次出现的一个可能的技巧是将所有张量转储到一个临时目录,然后再检查。这是一个快速而肮脏的例子来做到这一点。(假设NaN
s 出现在前几次运行中)
回答by sOvr9000
I was able to fix my NaN issues by getting rid of all of my dropout layers in the network model. I suspected that maybe for some reason a unit (neuron?) in the network lost too many input connections (so it had zero after the dropout), so then when information was fed through, it had a value of NaN. I don't see how that could happen over and over again with dropout=0.8 on layers with more than a hundred units each, so the problem was probably fixed for a different reason. Either way, commenting out the dropout layers fixed my issue.
通过摆脱网络模型中的所有 dropout 层,我能够解决 NaN 问题。我怀疑可能出于某种原因,网络中的一个单元(神经元?)丢失了太多的输入连接(因此在丢失后它为零),因此当信息通过时,它的值为 NaN。我不明白在每个超过一百个单元的层上 dropout=0.8 会如何一遍又一遍地发生,所以问题可能由于不同的原因得到解决。无论哪种方式,注释掉 dropout 层都解决了我的问题。
EDIT: Oops! I realized that I added a dropout layer after my final output layer which consists of three units. Now that makes more sense. So, don't do that!
编辑:哎呀!我意识到我在由三个单元组成的最终输出层之后添加了一个 dropout 层。现在这更有意义。所以,不要那样做!
回答by fxtentacle
For TensorFlow 2, inject some x=tf.debugging.check_numerics(x,'x is nan')
into your code. They will throw an InvalidArgument
error if x
has any values that are not a number (NaN) or infinity (Inf).
对于 TensorFlow 2,将一些注入x=tf.debugging.check_numerics(x,'x is nan')
到您的代码中。InvalidArgument
如果x
有任何不是数字 (NaN) 或无穷大 (Inf) 的值,它们将抛出错误。
Oh and for the next person finding this when hunting a TF2 NaN issue, my case turned out to be an exploding gradient. The gradient itself got to 1e+20, which was not quite NaN yet, but adding that to the variable then turned out too big. The diagnosis that I did was
哦,对于下一个在寻找 TF2 NaN 问题时发现这个问题的人,我的情况证明是一个爆炸梯度。梯度本身达到了 1e+20,这还不是 NaN,但是将其添加到变量中却变得太大了。我所做的诊断是
gradients = tape.gradient(loss, training_variables)
for g,v in zip(gradients, training_variables):
tf.print(v.name, tf.reduce_max(g))
optimizer.apply_gradients(zip(gradients, training_variables))
which revealed the overly large numbers. Running the exact same network on CPU worked fine, but it failed on the GTX 1080 TI in my workstation, thus making a CUDA numerical stability issue likely as the root cause. But since it only occurred sometimes, I duct-taped the whole thing by going with:
这揭示了过大的数字。在 CPU 上运行完全相同的网络运行良好,但在我工作站的 GTX 1080 TI 上却失败了,因此导致 CUDA 数值稳定性问题可能是根本原因。但由于它只是偶尔发生,我通过以下方式对整个事情进行了录音:
gradients = tape.gradient(loss, training_variables)
gradients = [tf.clip_by_norm(g, 10.0) for g in gradients]
optimizer.apply_gradients(zip(gradients, training_variables))
which will just clip exploding gradients to a sane value. For a network where gradients are always high, that wouldn't help, but since the magnitudes where high only sporadically, this fixed the problem and now the network trains nicely also on GPU.
这只会将爆炸梯度剪辑到一个合理的值。对于梯度始终很高的网络,这无济于事,但由于幅度只是偶尔高,这解决了问题,现在网络在 GPU 上也能很好地训练。