Python 训练回归网络时的 NaN 损失
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37232782/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
NaN loss when training regression network
提问by The_Anomaly
I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:
我有一个“one-hot encoding”(全是 1 和 0)的数据矩阵,有 260,000 行和 35 列。我正在使用 Keras 训练一个简单的神经网络来预测连续变量。制作网络的代码如下:
model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))
sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )
However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:
然而,在训练过程中,我看到损失下降得很好,但是在第二个时期的中间,它变成了 nan:
Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
13.4925
Epoch 2/3
88448/260000 [=========>....................] - ETA: 161s - loss: nan
I tried using RMSProp
instead of SGD
, I tried tanh
instead of relu
, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.
我尝试使用RMSProp
代替SGD
,我尝试tanh
代替relu
,我尝试使用和不使用辍学,但都无济于事。我尝试了一个较小的模型,即只有一个隐藏层,还有同样的问题(它在不同的点变成了 nan)。但是,它确实可以使用较少的特征,即如果只有 5 列,并且给出了相当好的预测。似乎有某种溢出,但我无法想象为什么 - 损失根本不是不合理的大。
Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.
Python 版本 2.7.11,在 linux 机器上运行,仅 CPU。我用最新版本的 Theano 测试了它,我也得到了 Nans,所以我尝试去 Theano 0.8.2 并遇到同样的问题。最新版本的 Keras 也有同样的问题,0.3.2 版本也是如此。
回答by 1''
Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem(the likely cause of the nans).
神经网络的回归很难工作,因为输出是无界的,所以你特别容易出现梯度爆炸问题(nans 的可能原因)。
Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.
从历史上看,梯度爆炸的一个关键解决方案是降低学习率,但是随着像 Adam 这样的每参数自适应学习率算法的出现,您不再需要设置学习率来获得良好的性能。除非您是神经网络狂热者并且知道如何调整学习计划,否则几乎没有理由再使用带有动量的 SGD。
Here are some things you could potentially try:
以下是您可以尝试的一些事情:
Normalize your outputs by quantile normalizingor z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.
通过分位数归一化或z 评分来归一化您的输出。严格来说,在训练数据上计算这种转换,而不是在整个数据集上。例如,对于分位数归一化,如果一个样本位于训练集的第 60 个百分位,则它的值为 0.6。(您还可以将分位数归一化值向下移动 0.5,以便第 0 个百分位数为 -0.5,第 100 个百分位数为 +0.5)。
添加正则化,通过增加辍学率或向权重添加 L1 和 L2 惩罚。L1 正则化类似于特征选择,既然你说将特征的数量减少到 5 可以提供良好的性能,那么 L1 也可以。
如果这些仍然没有帮助,请减小网络的大小。这并不总是最好的主意,因为它会损害性能,但在您的情况下,相对于输入特征 (35),您有大量的第一层神经元 (1024),因此它可能会有所帮助。
将批大小从 32 增加到 128。128 是相当标准的,可能会增加优化的稳定性。
回答by pir
The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.
1" 的答案非常好。但是,所有修复似乎都间接而不是直接解决了问题。我建议使用渐变裁剪,这将裁剪任何高于某个值的渐变。
In Keras you can use clipnorm=1
(see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.
在 Keras 中,您可以使用clipnorm=1
(参见https://keras.io/optimizers/)来简单地裁剪所有具有高于 1 的范数的梯度。
回答by HenryZhao
I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.
我之前也遇到过同样的问题。我搜索并找到了这个问题和答案。上面提到的所有这些技巧对于训练深度神经网络都很重要。我尝试了所有这些,但仍然得到了 NAN。
I also find this question here. https://github.com/fchollet/keras/issues/2134. I cited the author's summary as follows:
我也在这里找到了这个问题。https://github.com/fchollet/keras/issues/2134。我引用了作者的总结如下:
I wanted to point this out so that it's archived for others who may experience this problem in future. I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the relus, the optimizer, the loss function, my dropout in accordance with the relus, the size of my network and the shape of the network. I was still getting loss that eventually turned into a nan and I was getting quite fustrated.
Then it dawned on me. I may have some bad input. It turns out, one of the images that I was handing to my CNN (and doing mean normalization on) was nothing but 0's. I wasn't checking for this case when I subtracted the mean and normalized by the std deviation and thus I ended up with an exemplar matrix which was nothing but nan's. Once I fixed my normalization function, my network now trains perfectly.
我想指出这一点,以便为将来可能遇到此问题的其他人存档。我遇到了我的损失函数,在它进入训练过程之后突然返回了一个 nan。我检查了 relus、优化器、损失函数、我根据 relus 的 dropout、我的网络大小和网络形状。我仍然感到失落,最终变成了 nan,我感到非常沮丧。
然后我就明白了。我可能有一些不好的输入。事实证明,我交给我的 CNN(并对其进行均值归一化)的其中一张图像只是 0。当我减去均值并通过标准偏差归一化时,我没有检查这种情况,因此我最终得到了一个示例矩阵,它只是 nan 矩阵。一旦我修复了我的归一化函数,我的网络现在可以完美地训练。
I agree with the above viewpoint: the input is sensitive for your network. In my case,I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does notinclude -inf or inf, or some extremely large numbers in absolute value.
我同意上述观点:输入对您的网络很敏感。就我而言,我使用密度估计的对数值作为输入。绝对值可能非常大,这可能会在经过几步梯度后导致 NaN。我认为输入检查是必要的。首先,您应该确保输入不包含-inf 或 inf,或一些绝对值非常大的数字。
回答by Arnav
I faced a very similar problem, and this is how I got it to run.
我遇到了一个非常相似的问题,这就是我让它运行的方式。
The first thing you can try is changing your activation to LeakyReLU instead of using Relu or Tanh. The reason is that often, many of the nodes within your layers have an activation of zero, and backpropogation doesn't update the weights for these nodes because their gradient is also zero. This is also called the 'dying ReLU' problem (you can read more about it here: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks).
您可以尝试的第一件事是将激活更改为 LeakyReLU,而不是使用 Relu 或 Tanh。原因是通常,您的层中的许多节点的激活为零,并且反向传播不会更新这些节点的权重,因为它们的梯度也为零。这也称为“垂死的 ReLU”问题(您可以在此处阅读更多相关信息:https: //datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks)。
To do this, you can import the LeakyReLU activation using:
为此,您可以使用以下方法导入 LeakyReLU 激活:
from keras.layers.advanced_activations import LeakyReLU
and incorporate it within your layers like this:
并将其合并到您的图层中,如下所示:
model.add(Dense(800,input_shape=(num_inputs,)))
model.add(LeakyReLU(alpha=0.1))
Additionally, it is possible that the output feature (the continuous variable you are trying to predict) is an imbalanced data set and has too many 0s. One way to fix this issue is to use smoothing. You can do this by adding 1 to the numerator of all your values in this column and dividing each of the values in this column by 1/(average of all the values in this column)
此外,输出特征(您尝试预测的连续变量)可能是一个不平衡的数据集并且有太多的 0。解决此问题的一种方法是使用平滑。您可以通过在此列中所有值的分子上加 1 并将此列中的每个值除以 1/(此列中所有值的平均值)来实现此目的
This essentially shifts all the values from 0 to a value greater than 0 (which may still be very small). This prevents the curve from predicting 0s and minimizing the loss (eventually making it NaN). Smaller values are more greatly impacted than larger values, but on the whole, the average of the data set remains the same.
这实质上将所有值从 0 移到大于 0 的值(可能仍然非常小)。这可以防止曲线预测 0 并最小化损失(最终使其成为 NaN)。较小的值比较大的值受到的影响更大,但总的来说,数据集的平均值保持不变。
回答by javac
I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:
我在使用 LSTM 时遇到了同样的问题,问题是我的数据在标准化后有一些 nan 值,因此,如果您看到有 nan 值,我们应该检查标准化后的输入模型数据:
print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))
you can solve this by adding a small value(0.000001) to Std like this,
你可以通过像这样向 Std 添加一个小值(0.000001)来解决这个问题,
def standardize(train, test):
mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001
X_train = (train - mean) / std
X_test = (test - mean) /std
return X_train, X_test
回答by Rorschach
I had the same problem with my RNN with keras LSTM layers, so I tried each solution from above. I had already scaled my data (with sklearn.preprocessing.MinMaxScaler
), there were no NaN
values in my data after scaling. Solutions like using LeakyRelU or changing learning rate didn't help.
我的带有 keras LSTM 层的 RNN 遇到了同样的问题,所以我尝试了上面的每个解决方案。我已经缩放了我的数据(使用sklearn.preprocessing.MinMaxScaler
),缩放后我的数据中没有NaN
值。使用 LeakyRelU 或改变学习率之类的解决方案没有帮助。
So I decided to change the scaler from MinMaxScaler
to StandardScaler
, even though I had no NaN
values and I found it odd but it worked!
所以我决定将缩放器从 更改MinMaxScaler
为StandardScaler
,即使我没有任何NaN
值并且我发现它很奇怪但它有效!
回答by Krithi07
I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
训练一开始,我就在第一个时代失去了作为 nan 的损失。解决方案就像从对我有用的输入数据中删除 nas 一样简单 (df.dropna())
I hope this helps someone encountering similar problem
我希望这可以帮助遇到类似问题的人
回答by Not_Dave
I had similar issue with my logloss, MAE and others being all NA's. I looked into the data and found, I had few features with NA's in them. I imputed NA's with approximate values and was able to solve the issue.
我的 logloss,MAE 和其他都是 NA 的,我也有类似的问题。我查看了数据,发现其中包含 NA 的功能很少。我用近似值估算了 NA 并能够解决这个问题。
回答by Clay Coleman
I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer
with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that asciifiles were NOT working with keras, leading to nan
loss and accuracy of 0.0000e+00
; however, utf-8 and utf-16 files wereworking! Breakthrough.
我尝试了此页面上的所有建议以及许多其他建议,但均无济于事。我们使用 Pandas 导入 csv 文件,然后使用keras Tokenizer
文本输入来创建词汇表和词向量矩阵。注意到导致楠,而其他一些工作CSV文件后,我们突然看了看文件的编码,并意识到ASCII文件没有用keras工作,导致nan
损失和准确性0.0000e+00
; 但是,utf-8 和 utf-16 文件正在运行!突破。
If you're performing textual analysis and getting nan
loss after trying these suggestions, use file -i {input}
(linux) or file -I {input}
(osx) to discover your file type. If you have ISO-8859-1
or us-ascii
, try converting to utf-8
or utf-16le
. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!
如果您nan
在尝试这些建议后正在执行文本分析并迷失方向,请使用file -i {input}
(linux) 或file -I {input}
(osx) 来发现您的文件类型。如果您有ISO-8859-1
或us-ascii
,请尝试转换为utf-8
或utf-16le
。没有尝试过后者,但我想它也会起作用。希望这可以帮助非常非常沮丧的人!
回答by from_mars
I had the same problem with my keras CNN, as others I tried all above solutions: decrease learning rate, drop nullity from train data, normalize data, add dropout layer and ... but there couldn't solve nan problem, I tried change activation function in classifier (last) layer from sigmoid to softmax. It worked! try changing activation function of last layer to softmax!
我的 keras CNN 也有同样的问题,因为其他人我尝试了上述所有解决方案:降低学习率,从训练数据中删除无效性,规范化数据,添加 dropout 层和......但无法解决 nan 问题,我尝试改变从 sigmoid 到 softmax 的分类器(最后)层中的激活函数。有效!尝试将最后一层的激活函数更改为 softmax!