Python 如何在 Tensorflow 中设置分层学习率?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34945554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:44:27  来源:igfitidea点击:

How to set layer-wise learning rate in Tensorflow?

pythondeep-learningtensorflow

提问by Tong Shen

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

我想知道是否有一种方法可以对不同的层使用不同的学习率,例如 Caffe 中的学习率。我正在尝试修改预先训练的模型并将其用于其他任务。我想要的是加快对新添加层的训练,并将训练后的层保持在低学习率,以防止它们被扭曲。例如,我有一个 5-conv-layer 预训练模型。现在我添加一个新的 conv 层并对其进行微调。前 5 层的学习率为 0.00001,最后一层的学习率为 0.001。知道如何实现这一目标吗?

采纳答案by Rafa? Józefowicz

It can be achieved quite easily with 2 optimizers:

使用 2 个优化器可以很容易地实现:

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)

One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.

这种实现的一个缺点是它在优化器中计算 tf.gradients(.) 两次,因此它在执行速度方面可能不是最佳的。这可以通过显式调用 tf.gradients(.) 来缓解,将列表分成 2 个并将相应的梯度传递给两个优化器。

Related question: Holding variables constant during optimizer

相关问题:在优化器期间保持变量不变

EDIT: Added more efficient but longer implementation:

编辑:添加了更高效但更长的实现:

var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)

You can use tf.trainable_variables()to get all training variables and decide to select from them. The difference is that in the first implementation tf.gradients(.)is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).

您可以使用tf.trainable_variables()获取所有训练变量并决定从中进行选择。不同之处在于,在第一个实现tf.gradients(.)中,优化器内部调用了两次。这可能会导致执行一些冗余操作(例如,第一层上的梯度可以为后续层的梯度重用一些计算)。

回答by Yaroslav Bulatov

Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer, other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation

1 月 22 日更新:下面的配方只是一个好主意GradientDescentOptimizer,其他保持运行平均值的优化器将在参数更新之前应用学习率,因此下面的配方不会影响等式的那部分

In addition to Rafal's approach, you could use compute_gradients, apply_gradientsinterface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter

除了 Rafal 的方法之外,您还可以使用compute_gradients,apply_gradients接口Optimizer。例如,这是一个玩具网络,我使用 2 倍的学习率作为第二个参数

x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)

opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)

init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
  sess.run([train_op, loss, global_step])
  print sess.run([x, y])

You should see

你应该看到

[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]

回答by Sergey Demyanov

Collect learning rate multipliers for each variable like:

为每个变量收集学习率乘数,例如:

self.lr_multipliers[var.op.name] = lr_mult

and then apply them during before applying the gradients like:

然后在应用渐变之前应用它们,例如:

def _train_op(self):
  tf.scalar_summary('learning_rate', self._lr_placeholder)
  opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
  grads_and_vars = opt.compute_gradients(self._loss)
  grads_and_vars_mult = []
  for grad, var in grads_and_vars:
    grad *= self._network.lr_multipliers[var.op.name]
    grads_and_vars_mult.append((grad, var))
    tf.histogram_summary('variables/' + var.op.name, var)
    tf.histogram_summary('gradients/' + var.op.name, grad)
  return opt.apply_gradients(grads_and_vars_mult)

You can find the whole example here.

您可以在此处找到整个示例。

回答by Nicolas Pinchaud

The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

前 5 层的学习率为 0.00001,最后一层的学习率为 0.001。知道如何实现这一目标吗?

There is an easy way to do that using tf.stop_gradient. Here is an example with 3 layers:

使用 tf.stop_gradient 有一种简单的方法可以做到这一点。这是一个包含 3 个层的示例:

x = layer1(input)
x = layer2(x)
output = layer3(x)

You can shrink your gradient in the first two layers by a ratio of 1/100:

您可以按 1/100 的比例缩小前两层中的渐变:

x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)

On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.

在 layer2 上,“流”被分成两个分支:一个贡献为 1/100 的分支有规律地计算其梯度,但梯度幅度缩小了 1/100,另一个分支提供剩余的“流”由于 tf.stop_gradient 运算符而不会影响梯度。因此,如果您在模型优化器上使用 0.001 的学习率,前两层的学习率实际上将是 0.00001。

回答by P-Gn

Tensorflow 1.7 introduced tf.custom_gradientthat greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,

引入的 Tensorflow 1.7tf.custom_gradient极大地简化了设置学习率乘数,其方式现在与任何优化器兼容,包括那些累积梯度统计数据的优化器。例如,

import tensorflow as tf

def lr_mult(alpha):
  @tf.custom_gradient
  def _lr_mult(x):
    def grad(dy):
      return dy * alpha * tf.ones_like(x)
    return x, grad
  return _lr_mult

x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))

step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()

for _ in range(5):
  sess.run([step])
  print(sess.run([x0, x1, loss]))

回答by Lewis Smith

A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change

Sergey Demyanov 答案的细微变化,您只需指定要更改的学习率

from collections import defaultdict

self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
    grad *= self.learning_rates[var.op.name]
    grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())

回答by Morty

If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here: https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65

如果您碰巧使用 tf.slim + slim.learning.create_train_op 这里有一个很好的例子:https: //github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65

# Create the train_op and scale the gradients by providing a map from variable
  # name (or variable) to a scaling coefficient:
  gradient_multipliers = {
    'conv0/weights': 1.2,
    'fc8/weights': 3.4,
  }
  train_op = slim.learning.create_train_op(
      total_loss,
      optimizer,
      gradient_multipliers=gradient_multipliers)

Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

不幸的是,如果您想逐渐修改乘数,似乎不可能使用 tf.Variable 而不是浮点值。