Python 如何为 GradientDescentOptimizer 设置自适应学习率?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33919948/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set adaptive learning rate for GradientDescentOptimizer?
提问by displayname
I am using TensorFlow to train a neural network. This is how I am initializing the GradientDescentOptimizer
:
我正在使用 TensorFlow 来训练神经网络。这就是我初始化的方式GradientDescentOptimizer
:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
mse = tf.reduce_mean(tf.square(out - out_))
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)
The thing here is that I don't know how to set an update rule for the learning rate or a decay value for that.
这里的问题是我不知道如何为学习率或衰减值设置更新规则。
How can I use an adaptive learning rate here?
我如何在这里使用自适应学习率?
采纳答案by mrry
First of all, tf.train.GradientDescentOptimizer
is designed to use a constant learning rate for all variables in all steps. TensorFlow also provides out-of-the-box adaptive optimizers including the tf.train.AdagradOptimizer
and the tf.train.AdamOptimizer
, and these can be used as drop-in replacements.
首先,tf.train.GradientDescentOptimizer
旨在对所有步骤中的所有变量使用恒定的学习率。TensorFlow 还提供了开箱即用的自适应优化器,包括tf.train.AdagradOptimizer
和tf.train.AdamOptimizer
,这些可以用作替代品。
However, if you want to control the learning rate with otherwise-vanilla gradient descent, you can take advantage of the fact that the learning_rate
argument to the tf.train.GradientDescentOptimizer
constructorcan be a Tensor
object. This allows you to compute a different value for the learning rate in each step, for example:
然而,如果你想用其他普通的梯度下降来控制学习率,你可以利用构造函数的learning_rate
参数可以是一个对象这一事实。这允许您为每个步骤中的学习率计算不同的值,例如:tf.train.GradientDescentOptimizer
Tensor
learning_rate = tf.placeholder(tf.float32, shape=[])
# ...
train_step = tf.train.GradientDescentOptimizer(
learning_rate=learning_rate).minimize(mse)
sess = tf.Session()
# Feed different values for learning rate to each training step.
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.01})
sess.run(train_step, feed_dict={learning_rate: 0.01})
Alternatively, you could create a scalar tf.Variable
that holds the learning rate, and assign it each time you want to change the learning rate.
或者,您可以创建一个tf.Variable
保存学习率的标量,并在每次想要更改学习率时分配它。
回答by dga
Tensorflow provides an op to automatically apply an exponential decay to a learning rate tensor: tf.train.exponential_decay
. For an example of it in use, see this line in the MNIST convolutional model example. Then use @mrry's suggestion above to supply this variable as the learning_rate parameter to your optimizer of choice.
Tensorflow提供运到指数衰减自动应用到学习率张量:tf.train.exponential_decay
。有关使用中的示例,请参阅MNIST 卷积模型示例中的这一行。然后使用上面@mrry 的建议将此变量作为 learning_rate 参数提供给您选择的优化器。
The key excerpt to look at is:
要查看的关键摘录是:
# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0)
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
Note the global_step=batch
parameter to minimize. That tells the optimizer to helpfully increment the 'batch' parameter for you every time it trains.
注意global_step=batch
要最小化的参数。这告诉优化器在每次训练时为您增加“批处理”参数。
回答by Prakash Vanapalli
From tensorflowofficial docs
来自tensorflow官方文档
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)
# Passing global_step to minimize() will increment it at each step.
learning_step = (
tf.train.GradientDescentOptimizer(learning_rate)
.minimize(...my loss..., global_step=global_step))
回答by Salvador Dali
Gradient descent algorithm uses the constant learning rate which you can provide in during the initialization. You can pass various learning rates in a way showed by Mrry.
梯度下降算法使用您可以在初始化期间提供的恒定学习率。您可以通过 Mrry 展示的方式传递各种学习率。
But instead of it you can also use more advanced optimizerswhich have faster convergence rate and adapts to the situation.
但是,您也可以使用更高级的优化器来代替它,这些优化器具有更快的收敛速度并适应情况。
Here is a brief explanation based on my understanding:
下面根据我的理解做一个简单的解释:
- momentumhelpsSGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct dirrection and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makse sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentom. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
- nesterov accelerated gradientovercomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
- AdaGrador adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically illiminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning
- AdaDeltaresolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly decrease. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSpropis very similar to AdaDelta
Adamor adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately
- 动量有助于SGD 沿着相关方向导航,并软化不相关的振荡。它只是将前一步方向的一小部分添加到当前步骤中。这实现了在正确方向上的速度放大并在错误方向上软化振荡。这个分数通常在 (0, 1) 范围内。使用自适应动量也是有意义的。在开始学习时,大的动量只会阻碍你的进步,所以使用像 0.01 这样的东西是有道理的,一旦所有的高梯度消失,你就可以使用更大的动量。动量有一个问题:当我们非常接近目标时,我们的动量在大多数情况下非常高,它不知道应该放慢速度。这可能会导致它错过或围绕最小值振荡
- Nesterov 加速梯度通过提前开始减速克服了这个问题。在动量中,我们首先计算梯度,然后在那个方向上跳跃,由我们之前拥有的任何动量放大。NAG 做同样的事情,但顺序不同:首先我们根据我们存储的信息进行大跳跃,然后我们计算梯度并进行小幅修正。这种看似无关的变化提供了显着的实际加速。
- AdaGrad或自适应梯度允许学习率根据参数进行调整。它对不频繁的参数执行较大的更新,对频繁的参数执行较小的更新。因此,它非常适合稀疏数据(NLP 或图像识别)。另一个优点是它基本上不需要调整学习率。每个参数都有自己的学习率,并且由于算法的特殊性,学习率是单调递减的。这导致了最大的问题:在某个时间点学习率太小以至于系统停止学习
- AdaDelta解决了AdaGrad 中学习率单调递减的问题。在 AdaGrad 中,学习率大约计算为 1 除以平方根之和。在每个阶段,您将另一个平方根添加到总和,这会导致分母不断减小。在 AdaDelta 中,而不是将所有过去的平方根相加,它使用滑动窗口来减少总和。RMSprop与 AdaDelta 非常相似
Adam或自适应动量是一种类似于 AdaDelta 的算法。但是除了为每个参数存储学习率之外,它还分别存储每个参数的动量变化
一个几可视化:
回答by Ben
If you want to set specific learning rates for intervals of epochs like 0 < a < b < c < ...
. Then you can define your learning rate as a conditional tensor, conditional on the global step, and feed this as normal to the optimiser.
如果你想为像 0 < a < b < c < ...
. 然后,您可以将学习率定义为条件张量,以全局步骤为条件,并将其作为正常情况提供给优化器。
You could achieve this with a bunch of nested tf.cond
statements, but its easier to build the tensor recursively:
您可以使用一堆嵌套tf.cond
语句来实现这一点,但递归构建张量更容易:
def make_learning_rate_tensor(reduction_steps, learning_rates, global_step):
assert len(reduction_steps) + 1 == len(learning_rates)
if len(reduction_steps) == 1:
return tf.cond(
global_step < reduction_steps[0],
lambda: learning_rates[0],
lambda: learning_rates[1]
)
else:
return tf.cond(
global_step < reduction_steps[0],
lambda: learning_rates[0],
lambda: make_learning_rate_tensor(
reduction_steps[1:],
learning_rates[1:],
global_step,)
)
Then to use it you need to know how many training steps there are in a single epoch, so that we can use the global step to switch at the right time, and finally define the epochs and learning rates you want. So if I want the learning rates [0.1, 0.01, 0.001, 0.0001]
during the epoch intervals of [0, 19], [20, 59], [60, 99], [100, \infty]
respectively, I would do:
然后使用它你需要知道一个epoch有多少个训练步骤,这样我们就可以使用全局步骤在合适的时间进行切换,最后定义你想要的epochs和学习率。因此,如果我想要分别[0.1, 0.01, 0.001, 0.0001]
在 epoch 间隔期间的学习率[0, 19], [20, 59], [60, 99], [100, \infty]
,我会这样做:
global_step = tf.train.get_or_create_global_step()
learning_rates = [0.1, 0.01, 0.001, 0.0001]
steps_per_epoch = 225
epochs_to_switch_at = [20, 60, 100]
epochs_to_switch_at = [x*steps_per_epoch for x in epochs_to_switch_at ]
learning_rate = make_learning_rate_tensor(epochs_to_switch_at , learning_rates, global_step)