Python 批量标准化和辍学的顺序?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39691902/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 22:35:55  来源:igfitidea点击:

Ordering of batch normalization and dropout?

pythonneural-networktensorflowconv-neural-network

提问by golmschenk

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

最初的问题是专门针对 TensorFlow 实现的。但是,答案是针对一般实现的。这个通用答案也是 TensorFlow 的正确答案。

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

在 TensorFlow 中使用批量归一化和 dropout(特别是使用 contrib.layers)时,我需要担心排序吗?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I'm missing?

似乎有可能如果我使用 dropout 后立即进行批量标准化,可能会出现问题。例如,如果批量归一化的转变训练到训练输出的较大尺度数,但随后将相同的转变应用于较小的(由于具有更多输出的补偿)尺度数而不会在测试期间丢失,那么班次可能已关闭。TensorFlow 批量归一化层是否会自动对此进行补偿?或者这不是因为我失踪的某种原因而发生的吗?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I'm using them in the correct order in regards to the above (assuming there isa correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don't immediately see a problem with that, but I might be missing something.

另外,将这两者结合使用时是否还有其他陷阱需要注意?例如,假设我使用他们以正确的顺序在问候上述(假设有一个正确的顺序),可以存在与使用分批正常化和漏失在多个连续层烦恼?我没有立即看到问题,但我可能会遗漏一些东西。

Thank you much!

非常感谢!

UPDATE:

更新:

An experimental test seemsto suggest that ordering doesmatter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They're both going down in the other case. But in my case the movements are slow, so things may change after more training and it's just a single test. A more definitive and informed answer would still be appreciated.

一项实验测试似乎表明排序确实很重要。我只用批处理规范和 dropout 反向运行了相同的网络两次。当 dropout 在批规范之前,验证损失似乎随着训练损失的下降而上升。在另一种情况下,他们都失败了。但在我的情况下,动作很慢,所以经过更多训练后情况可能会发生变化,这只是一次测试。一个更明确和明智的答案仍然会受到赞赏。

回答by Zhongyu Kuang

In the Ioffe and Szegedy 2015, the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution". So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this videoat around time 53 min for more details.

Ioffe 和 Szegedy 2015 中,作者表示“我们希望确保对于任何参数值,网络始终产生具有所需分布的激活”。所以批归一化层实际上是在卷积层/全连接层之后插入的,但在输入 ReLu(或任何其他类型的)激活之前。有关更多详细信息,请在大约 53 分钟时观看此视频

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paperfigure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

就 dropout 而言,我相信 dropout 是在激活层之后应用的。在图 3b的dropout 论文中,隐藏层 l 的 dropout 因子/概率矩阵 r(l) 应用于 y(l),其中 y(l) 是应用激活函数 f 后的结果。

So in summary, the order of using batch normalization and dropout is:

所以总结一下,使用batch normalization和dropout的顺序是:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

-> CONV/FC -> BatchNorm -> ReLu(或其他激活) -> Dropout -> CONV/FC ->

回答by MiloMinderbinder

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

正如在评论中指出,一个惊人的资源来读起层的顺序是在这里。我已经浏览了评论,这是我在互联网上找到的关于该主题的最佳资源

My 2 cents:

我的 2 美分:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

Dropout 旨在完全阻止来自某些神经元的信息,以确保神经元不会共同适应。因此,批量标准化必须在 dropout 之后进行,否则您将通过标准化统计信息传递信息。

If you think about it, in typical ML problems, this is the reason we don't compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

如果您考虑一下,在典型的 ML 问题中,这就是我们不计算整个数据的均值和标准差,然后将其拆分为训练、测试和验证集的原因。我们拆分然后计算训练集的统计数据,并使用它们来规范化和中心化验证和测试数据集

so i suggest Scheme 1 (This takes pseudomarvin'scomment on accepted answer into consideration)

所以我建议方案 1(这考虑了伪马文对已接受答案评论)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

-> CONV/FC -> ReLu(或其他激活) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

与方案2相反

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

-> CONV/FC -> BatchNorm -> ReLu(或其他激活) -> Dropout -> CONV/FC -> 在接受的答案中

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2

请注意,这意味着与方案 1 下的网络相比,方案 2 下的网络应该表现出过度拟合,但 OP 运行了一些有问题的测试,它们支持方案 2

回答by xtluo

Usually, Just drop the Dropout(when you have BN):

通常,只需删除Dropout(当你有BN):

  • "BN eliminates the need for Dropoutin some cases cause BN provides similar regularization benefits as Dropout intuitively"
  • "Architectures like ResNet, DenseNet, etc. not using Dropout
  • Dropout在某些情况下,BN不需要,因为 BN 直观地提供了与 Dropout 类似的正则化优势”
  • “像 ResNet、DenseNet 等架构不使用 Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.

有关更多详细信息,请参阅@Haramoz 在评论中已经提到的这篇论文 [通过方差移位了解 Dropout 和批量归一化之间的不协调]。

回答by Renu

Based on the research paperfor better performance we should use BN before applying Dropouts

基于研究论文以获得更好的性能,我们应该在应用 Dropouts 之前使用 BN

回答by salehinejad

The correct order is: Conv > Normalization > Activation > Dropout > Pooling

正确的顺序是:Conv > Normalization > Activation > Dropout > Pooling

回答by mohamed Adel

I found a paper that explains the disharmony between Dropout and Batch Norm. The key idea is what they call the "variance shift". This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from the paper: https://arxiv.org/abs/1801.05134enter image description here

我找到了一篇论文,解释了 Dropout 和 Batch Norm 之间的不协调。关键思想是他们所谓的“方差转移”。这是因为 dropout 在训练和测试阶段之间具有不同的行为,这会改变 BN 学习的输入统计数据。主要思想可以在此图中找到,该图取自论文:https: //arxiv.org/abs/1801.05134在此处输入图片说明

A small demo for this effect can be found in this notebook https://github.com/adelizer/kaggle-sandbox/blob/master/drafts/dropout_bn.ipynb

可以在此笔记本中找到此效果的小演示https://github.com/adelizer/kaggle-sandbox/blob/master/drafts/dropout_bn.ipynb

回答by Lukas Nie?en

Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396

Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396

Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665

Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665

Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332

Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332

Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556

Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556



Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

在具有 2 个卷积模块(见下文)的 MNIST 数据集(20 个时期)上训练,每次使用

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropyand the optimizer is adam.

卷积层的内核大小为(3,3),默认填充,激活为elu。Pooling 是 poolside 的 MaxPooling (2,2)。损失为categorical_crossentropy,优化器为adam

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

对应的 Dropout 概率分别为 0.2 或 0.3。特征图的数量分别为 32 或 64。

Edit:When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm andDropout.

编辑:当我放弃 Dropout 时,如某些答案中所建议的那样,它收敛得更快,但泛化能力比我使用 BatchNormDropout时差。