Python 我在哪里调用 Keras 中的 BatchNormalization 函数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34716454/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:25:34  来源:igfitidea点击:

Where do I call the BatchNormalization function in Keras?

pythonkerasneural-networkdata-sciencebatch-normalization

提问by pr338

If I want to use the BatchNormalization function in Keras, then do I need to call it once only at the beginning?

如果我想在 Keras 中使用 BatchNormalization 函数,那么我是否只需要在开始时调用它一次?

I read this documentation for it: http://keras.io/layers/normalization/

我阅读了这个文档:http: //keras.io/layers/normalization/

I don't see where I'm supposed to call it. Below is my code attempting to use it:

我不知道我应该在哪里称呼它。下面是我尝试使用它的代码:

model = Sequential()
keras.layers.normalization.BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)

I ask because if I run the code with the second line including the batch normalization and if I run the code without the second line I get similar outputs. So either I'm not calling the function in the right place, or I guess it doesn't make that much of a difference.

我问是因为如果我在第二行运行代码,包括批量标准化,如果我在没有第二行的情况下运行代码,我会得到类似的输出。所以要么我没有在正确的地方调用这个函数,要么我想它没有太大的区别。

采纳答案by Lucas Ramadan

Just to answer this question in a little more detail, and as Pavel said, Batch Normalization is just another layer, so you can use it as such to create your desired network architecture.

只是更详细地回答这个问题,正如 Pavel 所说,批量归一化只是另一层,因此您可以使用它来创建所需的网络架构。

The general use case is to use BN between the linear and non-linear layers in your network, because it normalizes the input to your activation function, so that you're centered in the linear section of the activation function (such as Sigmoid). There's a small discussion of it here

一般用例是在网络中的线性层和非线性层之间使用 BN,因为它将输入归一化到您的激活函数,以便您以激活函数的线性部分为中心(例如 Sigmoid)。有一个关于它的小讨论here

In your case above, this might look like:

在您上面的情况下,这可能如下所示:



# import BatchNormalization
from keras.layers.normalization import BatchNormalization

# instantiate model
model = Sequential()

# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the hidden layer    
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the output layer
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# setting up the optimization of our weights 
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)

# running the fitting
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)


Hope this clarifies things a bit more.

希望这能澄清一些事情。

回答by Pavel Surmenok

It is another type of layer, so you should add it as a layer in an appropriate place of your model

它是另一种类型的图层,因此您应该将其作为图层添加到模型的适当位置

model.add(keras.layers.normalization.BatchNormalization())

See an example here: https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py

在此处查看示例:https: //github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py

回答by stochastic_zeitgeist

It's almost become a trend now to have a Conv2Dfollowed by a ReLufollowed by a BatchNormalizationlayer. So I made up a small function to call all of them at once. Makes the model definition look a whole lot cleaner and easier to read.

现在几乎成为一种趋势Conv2DReLu后跟一个BatchNormalization层。所以我编写了一个小函数来一次调用所有这些。使模型定义看起来更清晰、更易于阅读。

def Conv2DReluBatchNorm(n_filter, w_filter, h_filter, inputs):
    return BatchNormalization()(Activation(activation='relu')(Convolution2D(n_filter, w_filter, h_filter, border_mode='same')(inputs)))

回答by dontloo

Keras now supports the use_bias=Falseoption, so we can save some computation by writing like

Keras 现在支持该use_bias=False选项,因此我们可以通过编写像这样来节省一些计算

model.add(Dense(64, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('tanh'))

or

或者

model.add(Convolution2D(64, 3, 3, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('relu'))

回答by jmancuso

This thread is misleading. Tried commenting on Lucas Ramadan's answer, but I don't have the right privileges yet, so I'll just put this here.

该线程具有误导性。尝试对 Lucas Ramadan 的回答发表评论,但我还没有合适的权限,所以我就把它放在这里。

Batch normalization works best after the activation function, and hereor hereis why: it was developed to prevent internal covariate shift. Internal covariate shift occurs when the distribution of the activationsof a layer shifts significantly throughout training. Batch normalization is used so that the distribution of the inputs (and these inputs are literally the result of an activation function) to a specific layer doesn't change over time due to parameter updates from each batch (or at least, allows it to change in an advantageous way). It uses batch statistics to do the normalizing, and then uses the batch normalization parameters (gamma and beta in the original paper) "to make sure that the transformation inserted in the network can represent the identity transform" (quote from original paper). But the point is that we're trying to normalize the inputs to a layer, so it should always go immediately before the next layer in the network. Whether or not that's after an activation function is dependent on the architecture in question.

批量归一化在激活函数之后效果最好,这里这里是原因:它的开发是为了防止内部协变量偏移。当激活分布时发生内部协变量偏移在整个训练过程中,层的变化显着。使用批量归一化,以便特定层的输入(这些输入实际上是激活函数的结果)的分布不会因每个批次的参数更新而随时间变化(或至少允许它改变)以有利的方式)。它使用批量统计进行归一化,然后使用批量归一化参数(原论文中的 gamma 和 beta)“确保插入网络中的变换可以表示身份变换”(引自原论文)。但关键是我们试图将输入归一化到一个层,所以它应该总是在网络中的下一层之前。不管那是不是'

回答by user12340

This thread has some considerable debate about whether BN should be applied before non-linearity of current layer or to the activations of the previous layer.

这个线程有一些相当大的争论,关于 BN 是应该在当前层的非线性之前还是在前一层的激活之前应用。

Although there is no correct answer, the authors of Batch Normalization say that It should be applied immediately before the non-linearity of the current layer.The reason ( quoted from original paper) -

虽然没有正确答案,但 Batch Normalization 的作者说它 应该在当前层的非线性之前立即应用。原因(引自原论文)-

"We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution."

“我们在非线性之前立即添加 BN 变换,通过归一化 x = Wu+b。我们也可以归一化层输入 u,但由于 u 可能是另一个非线性的输出,其分布的形状可能会在训练,并限制其一阶和二阶矩不会消除协变量偏移。相比之下,Wu + b 更可能具有对称的非稀疏分布,即“更高的高斯分布”(Hyv_arinen & Oja,2000) ;归一化它可能会产生具有稳定分布的激活。”

回答by Aishwarya Radhakrishnan

Batch Normalization is used to normalize the input layer as well as hidden layers by adjusting mean and scaling of the activations. Because of this normalizing effect with additional layer in deep neural networks, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization regularizes the network such that it is easier to generalize, and it is thus unnecessary to use dropout to mitigate overfitting.

批量归一化用于通过调整激活的均值和缩放来归一化输入层和隐藏层。由于深度神经网络中附加层的这种归一化效应,网络可以使用更高的学习率而不会消失或爆炸梯度。此外,批量归一化使网络正则化,使其更容易泛化,因此没有必要使用 dropout 来减轻过拟合。

Right after calculating the linear function using say, the Dense() or Conv2D() in Keras, we use BatchNormalization() which calculates the linear function in a layer and then we add the non-linearity to the layer using Activation().

在使用 Keras 中的 Dense() 或 Conv2D() 计算线性函数之后,我们立即使用 BatchNormalization() 计算层中的线性函数,然后我们使用 Activation() 将非线性添加到层中。

from keras.layers.normalization import BatchNormalization
model = Sequential()
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, 
validation_split=0.2, verbose = 2)

How is Batch Normalization applied?

如何应用批量标准化?

Suppose we have input a[l-1] to a layer l. Also we have weights W[l] and bias unit b[l] for the layer l. Let a[l] be the activation vector calculated(i.e. after adding the non-linearity) for the layer l and z[l] be the vector before adding non-linearity

假设我们已经将 a[l-1] 输入到层 l。我们还有层 l 的权重 W[l] 和偏置单元 b[l]。设 a[l] 是为第 l 层计算的激活向量(即添加非线性之后),z[l] 是添加非线性之前的向量

  1. Using a[l-1] and W[l] we can calculate z[l] for the layer l
  2. Usually in feed-forward propagation we will add bias unit to the z[l] at this stage like this z[l]+b[l], but in Batch Normalization this step of addition of b[l] is not required and no b[l] parameter is used.
  3. Calculate z[l] means and subtract it from each element
  4. Divide (z[l] - mean) using standard deviation. Call it Z_temp[l]
  5. Now define new parameters γ and β that will change the scale of the hidden layer as follows:

    z_norm[l] = γ.Z_temp[l] + β

  1. 使用 a[l-1] 和 W[l] 我们可以计算第 l 层的 z[l]
  2. 通常在前馈传播中,我们会在这个阶段向 z[l] 添加偏置单元,就像这样 z[l]+b[l],但在批量归一化中,这一步添加 b[l] 不是必需的,也不需要使用 b[l] 参数。
  3. 计算 z[l] 平均值并从每个元素中减去它
  4. 使用标准偏差除以 (z[l] - 平均值)。称之为 Z_temp[l]
  5. 现在定义新的参数 γ 和 β,它们将改变隐藏层的规模,如下所示:

    z_norm[l] = γ.Z_temp[l] + β

In this code excerpt, the Dense() takes the a[l-1], uses W[l] and calculates z[l]. Then the immediate BatchNormalization() will perform the above steps to give z_norm[l]. And then the immediate Activation() will calculate tanh(z_norm[l]) to give a[l] i.e.

在这段代码摘录中,Dense() 采用 a[l-1],使用 W[l] 并计算 z[l]。然后立即 BatchNormalization() 将执行上述步骤以给出 z_norm[l]。然后立即 Activation() 将计算 tanh(z_norm[l]) 给出 a[l] 即

a[l] = tanh(z_norm[l])