Python 如何在 PyTorch 中初始化权重?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49433936/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:06:03  来源:igfitidea点击:

How to initialize weights in PyTorch?

pythonneural-networkdeep-learningpytorch

提问by Fábio Perez

How to initialize the weights and biases (for example, with He or Xavier initialization) in a network in PyTorch?

如何在 PyTorch 中初始化网络中的权重和偏差(例如,使用 He 或 Xavier 初始化)?

回答by Fábio Perez

Single layer

单层

To initialize the weights of a single layer, use a function from torch.nn.init. For instance:

要初始化单个层的权重,请使用来自 的函数torch.nn.init。例如:

conv1 = torch.nn.Conv2d(...)
torch.nn.init.xavier_uniform(conv1.weight)

Alternatively, you can modify the parameters by writing to conv1.weight.data(which is a torch.Tensor). Example:

或者,您可以通过写入conv1.weight.data(即 a torch.Tensor)来修改参数。例子:

conv1.weight.data.fill_(0.01)

The same applies for biases:

这同样适用于偏见:

conv1.bias.data.fill_(0.01)

nn.Sequentialor custom nn.Module

nn.Sequential或定制 nn.Module

Pass an initialization function to torch.nn.Module.apply. It will initialize the weights in the entire nn.Modulerecursively.

将初始化函数传递给torch.nn.Module.apply. 它将在整个nn.Module递归中初始化权重。

apply(fn):Applies fnrecursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also torch-nn-init).

申请(FN):适用fn递归到每个子模块(通过返回的.children()),以及自我。典型用途包括初始化模型的参数(另请参见 torch-nn-init)。

Example:

例子:

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

回答by ashunigion

We compare different mode of weight-initialization using the same neural-network(NN) architecture.

我们使用相同的神经网络 (NN) 架构比较了不同的权重初始化模式。

All Zeros or Ones

全零或一

If you follow the principle of Occam's razor, you might think setting all the weights to 0 or 1 would be the best solution. This is not the case.

如果您遵循奥卡姆剃刀原则,您可能会认为将所有权重设置为 0 或 1 是最佳解决方案。不是这种情况。

With every weight the same, all the neurons at each layer are producing the same output. This makes it hard to decide which weights to adjust.

在每个权重相同的情况下,每一层的所有神经元都产生相同的输出。这使得很难决定要调整哪些权重。

    # initialize two NN's with 0 and 1 constant weights
    model_0 = Net(constant_weight=0)
    model_1 = Net(constant_weight=1)
  • After 2 epochs:
  • 2个时期后:

plot of training loss with weight initialization to constant

权重初始化为常数的训练损失图

Validation Accuracy
9.625% -- All Zeros
10.050% -- All Ones
Training Loss
2.304  -- All Zeros
1552.281  -- All Ones

Uniform Initialization

统一初始化

A uniform distributionhas the equal probability of picking any number from a set of numbers.

均匀分布具有从一组数字拾取任何数量的相等概率。

Let's see how well the neural network trains using a uniform weight initialization, where low=0.0and high=1.0.

让我们看看使用统一权重初始化 wherelow=0.0和 来训练神经网络的效果如何high=1.0

Below, we'll see another way (besides in the Net class code) to initialize the weights of a network. To define weights outside of the model definition, we can:

下面,我们将看到另一种初始化网络权重的方法(除了在 Net 类代码中)。要在模型定义之外定义权重,我们可以:

  1. Define a function that assigns weights by the type of network layer, then
  2. Apply those weights to an initialized model using model.apply(fn), which applies a function to each model layer.
  1. 定义一个按网络层类型分配权重的函数,然后
  2. 使用 将这些权重应用于初始化模型model.apply(fn),这将函数应用于每个模型层。
    # takes in a module and applies the specified weight initialization
    def weights_init_uniform(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # apply a uniform distribution to the weights and a bias=0
            m.weight.data.uniform_(0.0, 1.0)
            m.bias.data.fill_(0)

    model_uniform = Net()
    model_uniform.apply(weights_init_uniform)
  • After 2 epochs:
  • 2个时期后:

enter image description here

在此处输入图片说明

Validation Accuracy
36.667% -- Uniform Weights
Training Loss
3.208  -- Uniform Weights

General rule for setting weights

设置权重的一般规则

The general rule for setting the weights in a neural network is to set them to be close to zero without being too small.

在神经网络中设置权重的一般规则是将它们设置为接近于零而不是太小。

Good practice is to start your weights in the range of [-y, y] where y=1/sqrt(n)
(n is the number of inputs to a given neuron).

好的做法是在 [-y, y] 范围内开始您的权重,其中y=1/sqrt(n)
(n 是给定神经元的输入数量)。

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform_rule(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # get the number of the inputs
            n = m.in_features
            y = 1.0/np.sqrt(n)
            m.weight.data.uniform_(-y, y)
            m.bias.data.fill_(0)

    # create a new model with these weights
    model_rule = Net()
    model_rule.apply(weights_init_uniform_rule)

below we compare performance of NN, weights initialized with uniform distribution [-0.5,0.5) versus the one whose weight is initialized using general rule

下面我们比较了 NN 的性能,权重初始化为均匀分布 [-0.5,0.5) 与权重使用一般规则初始化的那个

  • After 2 epochs:
  • 2个时期后:

plot showing performance of uniform initialization of weight versus general rule of initialization

图显示了权重的统一初始化的性能与初始化的一般规则

Validation Accuracy
75.817% -- Centered Weights [-0.5, 0.5)
85.208% -- General Rule [-y, y)
Training Loss
0.705  -- Centered Weights [-0.5, 0.5)
0.469  -- General Rule [-y, y)

normal distribution to initialize the weights

初始化权重的正态分布

The normal distribution should have a mean of 0 and a standard deviation of y=1/sqrt(n), where n is the number of inputs to NN

正态分布的均值为 0,标准差为y=1/sqrt(n),其中 n 是 NN 的输入数

    ## takes in a module and applies the specified weight initialization
    def weights_init_normal(m):
        '''Takes in a module and initializes all linear layers with weight
           values taken from a normal distribution.'''

        classname = m.__class__.__name__
        # for every Linear layer in a model
        if classname.find('Linear') != -1:
            y = m.in_features
        # m.weight.data shoud be taken from a normal distribution
            m.weight.data.normal_(0.0,1/np.sqrt(y))
        # m.bias.data should be 0
            m.bias.data.fill_(0)

below we show the performance of two NN one initialized using uniform-distributionand the other using normal-distribution

下面我们展示了两个 NN 的性能,一个使用均匀分布初始化,另一个使用正态分布初始化

  • After 2 epochs:
  • 2个时期后:

performance of weight initialization using uniform-distribution versus the normal distribution

使用均匀分布与正态分布的权重初始化性能

Validation Accuracy
85.775% -- Uniform Rule [-y, y)
84.717% -- Normal Distribution
Training Loss
0.329  -- Uniform Rule [-y, y)
0.443  -- Normal Distribution

回答by prosti

To initialize layers you typically don't need to do anything.

要初始化图层,您通常不需要做任何事情。

PyTorch will do it for you. If you think about, this has lot of sense. Why should we initialize layers, when PyTorch can do that following the latest trends.

PyTorch 会为您完成。如果你仔细想想,这很有道理。当 PyTorch 可以按照最新趋势进行初始化时,我们为什么要初始化层。

Check for instance the Linear layer.

检查例如线性层

In the __init__method it will call Kaiming Heinit function.

__init__方法中它会调用开明赫init函数。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

The similar is for other layers types. For conv2dfor instance check here.

其他图层类型也类似。例如conv2d这里检查。

To note : The gain of proper initialization is the faster training speed. If your problem deserves special initialization you can do it afterwords.

注意:正确初始化的增益是更快的训练速度。如果您的问题需要特殊初始化,您可以在事后进行。

回答by Luca Di Liello

Sorry for being so late, I hope my answer will help.

抱歉来晚了,希望我的回答能帮到你。

To initialise weights with a normal distributionuse:

normal distribution使用以下命令初始化权重:

torch.nn.init.normal_(tensor, mean=0, std=1)

Or to use a constant distributionwrite:

或者使用constant distribution写:

torch.nn.init.constant_(tensor, value)

Or to use an uniform distribution:

或者使用一个uniform distribution

torch.nn.init.uniform_(tensor, a=0, b=1) # a: lower_bound, b: upper_bound

You can check other methods to initialise tensors here

您可以在此处查看初始化张量的其他方法

回答by Duane

    import torch.nn as nn        

    # a simple network
    rand_net = nn.Sequential(nn.Linear(in_features, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, 1),
                             nn.ReLU())

    # initialization function, first checks the module type,
    # then applies the desired changes to the weights
    def init_normal(m):
        if type(m) == nn.Linear:
            nn.init.uniform_(m.weight)

    # use the modules apply function to recursively apply the initialization
    rand_net.apply(init_normal)

回答by ted

Iterate over parameters

迭代参数

If you cannot use applyfor instance if the model does not implement Sequentialdirectly:

如果您不能使用apply例如如果模型没有Sequential直接实现:

Same for all

所有人都一样

# see UNet at https://github.com/milesial/Pytorch-UNet/tree/master/unet


def init_all(model, init_func, *params, **kwargs):
    for p in model.parameters():
        init_func(p, *params, **kwargs)

model = UNet(3, 10)
init_all(model, torch.nn.init.normal_, mean=0., std=1) 
# or
init_all(model, torch.nn.init.constant_, 1.) 

Depending on shape

根据形状

def init_all(model, init_funcs):
    for p in model.parameters():
        init_func = init_funcs.get(len(p.shape), init_funcs["default"])
        init_func(p)

model = UNet(3, 10)
init_funcs = {
    1: lambda x: torch.nn.init.normal_(x, mean=0., std=1.), # can be bias
    2: lambda x: torch.nn.init.xavier_normal_(x, gain=1.), # can be weight
    3: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv1D filter
    4: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv2D filter
    "default": lambda x: torch.nn.init.constant(x, 1.), # everything else
}

init_all(model, init_funcs)

You can try with torch.nn.init.constant_(x, len(x.shape))to check that they are appropriately initialized:

您可以尝试torch.nn.init.constant_(x, len(x.shape))检查它们是否已正确初始化:

init_funcs = {
    "default": lambda x: torch.nn.init.constant_(x, len(x.shape))
}

回答by Nicolas Gervais

If you want some extra flexibility, you can also set the weights manually.

如果您想要一些额外的灵活性,您还可以手动设置权重

Say you have input of all ones:

假设你有所有的输入:

import torch
import torch.nn as nn

input = torch.ones((8, 8))
print(input)
tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

And you want to make a dense layer with no bias (so we can visualize):

并且您想要制作一个没有偏差的密集层(以便我们可以可视化):

d = nn.Linear(8, 8, bias=False)

Set all the weights to 0.5 (or anything else):

将所有权重设置为 0.5(或其他任何值):

d.weight.data = torch.full((8, 8), 0.5)
print(d.weight.data)

The weights:

权重:

Out[14]: 
tensor([[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

All your weights are now 0.5. Pass the data through:

你所有的权重现在都是 0.5。通过以下方式传递数据:

d(input)
Out[13]: 
tensor([[4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.]], grad_fn=<MmBackward>)

Remember that each neuron receives 8 inputs, all of which have weight 0.5 and value of 1 (and no bias), so it sums up to 4 for each.

请记住,每个神经元接收 8 个输入,所有输入的权重均为 0.5,值为 1(并且没有偏差),因此每个输入的总和为 4。

回答by Glory Chen

Cuz I haven't had the enough reputation so far, I can't add a comment under

因为到目前为止我还没有足够的声誉,我无法在下面添加评论

the answer posted by prostiin Jun 26 '19 at 13:16.

prosti20196 月 26 日 13:16发布的答案。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

But I wanna point out that actually we know some assumptions in the paper of Kaiming He, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, are not appropriate, though it looks like the deliberately designed initialization method makes a hit in practice.

但我想指出,实际上我们知道Kaiming He的论文中的一些假设,Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,虽然看起来故意设计的初始化方法在实践中很受欢迎.

E.g., within the subsection of Backward Propagation Case, they assume that $w_l$ and $\delta y_l$ are independent of each other. But as we all known, take the score map $\delta y^L_i$ as an instance, it often is $y_i-softmax(y^L_i)=y_i-softmax(w^L_ix^L_i)$ if we use a typical cross entropy loss function objective.

例如,在Backward Propagation Case的小节中,他们假设 $w_l$ 和 $\delta y_l$ 彼此独立。但众所周知,以得分图 $\delta y^L_i$ 为例,如果我们使用典型的,通常是 $y_i-softmax(y^L_i)=y_i-softmax(w^L_ix^L_i)$交叉熵损失函数目标。

So I think the true underlying reason why He's Initializationworks well remains to unravel. Cuz everyone has witnessed its power on boosting deep learning training.

所以我认为他的初始化工作良好的真正根本原因仍有待解开。因为每个人都见证了它在促进深度学习训练方面的力量。

回答by Joseph Konan

If you see a deprecation warning (@Fábio Perez)...

如果您看到弃用警告 (@Fábio Perez)...

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)