Python 如何在 TensorFlow 中应用梯度裁剪?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36498127/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 17:57:40  来源:igfitidea点击:

How to apply gradient clipping in TensorFlow?

pythonmachine-learningtensorflowdeep-learninglstm

提问by Arsenal Fanatic

Considering the example code.

考虑示例代码

I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.

我想知道如何在 RNN 上的这个网络上应用梯度裁剪,其中有可能发生梯度爆炸。

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

This is an example that could be used but where do I introduce this ? In the def of RNN

这是一个可以使用的例子,但我在哪里介绍呢?在 RNN 的定义中

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

But this doesn't make sense as the tensor _X is the input and not the grad what is to be clipped?

但这没有意义,因为张量 _X 是输入而不是要裁剪的 grad 什么?

Do I have to define my own Optimizer for this or is there a simpler option?

我必须为此定义自己的优化器还是有更简单的选择?

回答by Styrke

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model's parameters. In your example, both of those things are handled by the AdamOptimizer.minimize()method.

梯度裁剪需要在计算梯度之后,但在应用它们更新模型参数之前发生。在您的示例中,这两件事都由该AdamOptimizer.minimize()方法处理。

In order to clip your gradients you'll need to explicitly compute, clip, and apply them as described in this section in TensorFlow's API documentation. Specifically you'll need to substitute the call to the minimize()method with something like the following:

为了裁剪您的梯度,您需要按照 TensorFlow 的 API 文档中的本节所述明确计算、裁剪和应用它们。具体来说,您需要使用以下内容替换对该minimize()方法的调用:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

回答by danijar

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

尽管看起来很流行,但您可能希望通过其全局范数来裁剪整个梯度:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

单独裁剪每个梯度矩阵会改变它们的相对比例,但也是可能的:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don't need to store the update op because it runs automatically without passing it to a session:

在 TensorFlow 2 中,一个磁带计算梯度,优化器来自 Keras,我们不需要存储更新操作,因为它会自动运行而无需将其传递给会话:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

回答by Salvador Dali

This is actually properly explained in the documentation.:

这实际上在文档中得到了正确解释。

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:

  • Compute the gradients with compute_gradients().
  • Process the gradients as you wish.
  • Apply the processed gradients with apply_gradients().

调用 minimum() 负责计算梯度并将它们应用于变量。如果你想在应用之前处理梯度,你可以分三个步骤使用优化器:

  • 使用 compute_gradients() 计算梯度。
  • 根据需要处理渐变。
  • 使用 apply_gradients() 应用处理后的渐变。

And in the example they provide they use these 3 steps:

在他们提供的示例中,他们使用了以下 3 个步骤:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

Here MyCapperis any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.

MyCapper是限制渐变的任何函数。有用函数(除了tf.clip_by_value())的列表在这里

回答by kmario23

For those who would like to understand the idea of gradient clipping (by norm):

对于那些想了解梯度裁剪(按规范)的想法的人:

Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.

每当梯度范数大于特定阈值时,我们就会裁剪梯度范数,使其保持在阈值内。此阈值有时设置为5

Let the gradient be gand the max_norm_threshold be j.

让梯度为g, max_norm_threshold 为j

Now, if ||g|| > j, we do:

现在,如果 || || > j,我们这样做:

g= ( j* g) / ||g||

g= ( j* g) / || ||

This is the implementation done in tf.clip_by_norm

这是在 tf.clip_by_norm

回答by Ido Cohn

IMO the best solution is wrapping your optimizer with TF's estimator decorator tf.contrib.estimator.clip_gradients_by_norm:

IMO 最好的解决方案是用 TF 的 estimator 装饰器包装您的优化器tf.contrib.estimator.clip_gradients_by_norm

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

This way you only have to define this once, and not run it after every gradients calculation.

这样你只需要定义一次,而不是在每次梯度计算后运行它。

Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

文档:https: //www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

回答by Raj

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .

梯度裁剪基本上有助于梯度爆炸或消失的情况。假设您的损失太高,这将导致指数梯度流过网络,这可能会导致 Nan 值。为了克服这个问题,我们在特定范围(-1 到 1 或根据条件的任何范围)内剪辑梯度。

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.

其中 grads _and_vars 是梯度对(您通过 tf.compute_gradients 计算)及其将应用于的变量。

After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)

裁剪后,我们只需使用优化器应用其值。 optimizer.apply_gradients(clipped_value)