Python 我需要什么 K.clear_session() 和 del 模型（Keras with Tensorflow-gpu）？

Question

提问by benjamin

What I am doing
I am training and using a convolutional neuron network (CNN) for image-classification using Keras with Tensorflow-gpu as backend.

我在做什么，
我正在训练并使用卷积神经元网络 (CNN) 进行图像分类，使用 Keras 和 Tensorflow-gpu 作为后端。

What I am using
- PyCharm Community 2018.1.2
- both Python 2.7 and 3.5 (but not both at a time)
- Ubuntu 16.04
- Keras 2.2.0
- Tensorflow-GPU 1.8.0 as backend

我在用什么
- PyCharm Community 2018.1.2
- Python 2.7 和 3.5（但不能同时使用）
- Ubuntu 16.04 - Keras
2.2.0
- Tensorflow-GPU 1.8.0 作为后端

What I want to know
In many codes I see people using

我想知道的
在许多代码中，我看到人们使用

from keras import backend as K 

# Do some code, e.g. train and save model

K.clear_session()

or deleting the model after using it:

或在使用后删除模型：

del model

The keras documentation says regarding clear_session: "Destroys the current TF graph and creates a new one. Useful to avoid clutter from old models / layers." - https://keras.io/backend/

keras 文档说clear_session：“破坏当前的 TF 图并创建一个新的图。有助于避免旧模型/层的混乱。” - https://keras.io/backend/

What is the point of doing that and should I do it as well? When loading or creating a new model my model gets overwritten anyway, so why bother?

这样做有什么意义，我也应该这样做吗？在加载或创建新模型时，我的模型无论如何都会被覆盖，那何必呢？

Answer 1

回答by Chris Swinchatt

K.clear_session()is useful when you're creating multiple models in succession, such as during hyperparameter search or cross-validation. Each model you train adds nodes (potentially numbering in the thousands) to the graph. TensorFlow executes the entire graph whenever you (or Keras) call tf.Session.run()or tf.Tensor.eval(), so your models will become slower and slower to train, and you may also run out of memory. Clearing the session removes all the nodes left over from previous models, freeing memory and preventing slowdown.

K.clear_session()当您连续创建多个模型时非常有用，例如在超参数搜索或交叉验证期间。您训练的每个模型都会向图中添加节点（可能以千计）。每当您（或 Keras）调用tf.Session.run()or 时tf.Tensor.eval()，TensorFlow 都会执行整个图，因此您的模型训练会变得越来越慢，并且您也可能会耗尽内存。清除会话会删除以前模型遗留的所有节点，释放内存并防止速度变慢。

Edit 21/06/19:

21/06/19 编辑：

TensorFlow is lazy-evaluated by default. TensorFlow operations aren't evaluated immediately: creating a tensor or doing some operations to it creates nodes in a dataflow graph. The results are calculated by evaluating the relevant parts of the graph in one go when you call tf.Session.run()or tf.Tensor.eval(). This is so TensorFlow can build an execution plan that allocates operations that can be performed in parallel to different devices. It can also fold adjacent nodes together or remove redundant ones (e.g. if you concatenated two tensors and later split them apart again unchanged). For more details, see https://www.tensorflow.org/guide/graphs

默认情况下，TensorFlow 是惰性评估的。TensorFlow 操作不会立即评估：创建张量或对其执行一些操作会在数据流图中创建节点。当您调用tf.Session.run()或时，通过一次性评估图形的相关部分来计算结果tf.Tensor.eval()。这样 TensorFlow 就可以构建一个执行计划，将可以并行执行的操作分配给不同的设备。它还可以将相邻的节点折叠在一起或删除多余的节点（例如，如果您连接两个张量，然后将它们再次分开，保持不变）。有关更多详细信息，请参阅https://www.tensorflow.org/guide/graphs

All of your TensorFlow models are stored in the graph as a series of tensors and tensor operations. The basic operation of machine learning is tensor dot product - the output of a neural network is the dot product of the input matrix and the network weights. If you have a single-layer perceptron and 1,000 training samples, then each epoch creates at least 1,000 tensor operations. If you have 1,000 epochs, then your graph contains at least 1,000,000 nodes at the end, before taking into account preprocessing, postprocessing, and more complex models such as recurrent nets, encoder-decoder, attentional models, etc.

您的所有 TensorFlow 模型都作为一系列张量和张量运算存储在图中。机器学习的基本操作是张量点积——神经网络的输出是输入矩阵和网络权重的点积。如果您有一个单层感知器和 1,000 个训练样本，那么每个 epoch 至少会创建 1,000 个张量操作。如果您有 1,000 个 epoch，那么在考虑预处理、后处理和更复杂的模型（例如循环网络、编码器-解码器、注意模型等）之前，您的图最后至少包含 1,000,000 个节点。

The problem is that eventually the graph would be too large to fit into video memory (6 GB in my case), so TF would shuttle parts of the graph from video to main memory and back. Eventually it would even get too large for main memory (12 GB) and start moving between main memory and the hard disk. Needless to say, this made things incredibly, and increasingly, slow as training went on. Before developing this save-model/clear-session/reload-model flow, I calculated that, at the per-epoch rate of slowdown I experienced, my model would have taken longer than the age of the universe to finish training.

问题是最终图形会太大而无法放入视频内存（在我的情况下为 6 GB），因此 TF 会将图形的一部分从视频穿梭到主内存并返回。最终它甚至会变得对于主内存 (12 GB) 来说太大，并开始在主内存和硬盘之间移动。毋庸置疑，随着训练的进行，这让事情变得难以置信，而且越来越慢。在开发这个保存模型/清除会话/重新加载模型流程之前，我计算出，按照我经历的每个时代的减速率，我的模型需要比宇宙年龄更长的时间来完成训练。

Disclaimer: I haven't used TensorFlow in almost a year, so this might have changed. I remember there being quite a few GitHub issues around this so hopefully it has since been fixed.

免责声明：我已经快一年没有使用 TensorFlow，所以这可能已经改变了。我记得有很多 GitHub 问题围绕着这个，所以希望它已经得到修复。

Answer 2

回答by Tawej

del will delete variable in python and since model is a variable, del model will delete it but the TF graph will have no changes (TF is your Keras backend). This said, K.clear_session() will destroy the current TF graph and creates a new one. Creating a new model seems to be an independent step, but don't forget the backend :)

del 将删除 python 中的变量，因为模型是一个变量，del 模型将删除它，但 TF 图不会有任何变化（TF 是你的 Keras 后端）。这就是说， K.clear_session() 将破坏当前的 TF 图并创建一个新的图。创建新模型似乎是一个独立的步骤，但不要忘记后端:)

Answer 3

回答by brethvoice

During cross-validation, I wanted to run number_of_replicatesfolds (a.k.a. replicates) to get an average validation loss as a basis for comparison to another algorithm. So I needed to perform cross-validation for two separate algorithms, and I have multiple GPUs available so figured this would not be a problem.

在交叉验证期间，我想运行number_of_replicates折叠（又名复制）以获得平均验证损失作为与另一种算法进行比较的基础。所以我需要对两个独立的算法执行交叉验证，而且我有多个可用的 GPU，所以我认为这不是问题。

Unfortunately, I started seeing layer names get thing like _2, _3, etc. appended to them in my loss logs. I also noticed that if I ran through the replicates (a.k.a. folds) sequentially by using a loop in a single script, I ran out of memory on the GPUs.

不幸的是，我开始看到图层名称得到这样的事情_2，_3等我失去的日志追加到它们。我还注意到，如果我通过在单个脚本中使用循环按顺序运行复制（又名折叠），我会耗尽 GPU 上的内存。

This strategy worked for me; I have been running for hours on end now in tmuxsessions on an Ubuntu lambda machine, sometimes seeing memory leaks but they are killed off by a timeout function. It requires estimating the length of time it could take to complete each cross-validation fold/replicate; in the code below that number is max_duration_in_seconds(best to double the number of trips through the loop in case half of them get killed off):

这个策略对我有用；我现在已经在tmuxUbuntu lambda 机器上的会话中连续运行了几个小时，有时会看到内存泄漏，但它们被超时函数杀死了。它需要估计完成每个交叉验证折叠/复制可能需要的时间长度；在下面的代码中，该数字是max_duration_in_seconds（最好将循环次数加倍，以防其中一半被杀死）：

from multiprocessing import Process

# establish target for process workers
def machine():
    import tensorflow as tf
    from tensorflow.keras.backend import clear_session

    from tensorflow.python.framework.ops import disable_eager_execution
    import gc

    clear_session()

    disable_eager_execution()  
    nEpochs = 999 # set lower if not using tf.keras.callbacks.EarlyStopping in callbacks
    callbacks = ... # establish early stopping, logging, etc. if desired

    algorithm_model = ... # define layers, output(s), etc.
    opt_algorithm = ... # choose your optimizer
    loss_metric = ... # choose your loss function(s) (in a list for multiple outputs)
    algorithm_model.compile(optimizer=opt_algorithm, loss=loss_metric)

    trainData = ... # establish which data to train on (for this fold/replicate only)
    validateData = ... # establish which data to validate on (same caveat as above)
    algorithm_model.fit(
        x=trainData,
        steps_per_epoch=len(trainData),
        validation_data=validateData,
        validation_steps=len(validateData),
        epochs=nEpochs,
        callbacks=callbacks
    )

    gc.collect()
    del algorithm_model

    return


# establish main loop to start each process
def main_loop():
    for validation_replicate in range(2 * number_of_replicates):
        if validation_replicate%2 == 0:
            print('\nStarting cross-validation replicate {} of {}:\n'.format(int(validation_replicate/2), number_of_replicates))
        p = Process(target=machine)
        p.start()
        p.join(max_duration_in_seconds)
    return


# enable running of this script from command line
if __name__ == "__main__":
    main_loop()

Python 我需要什么 K.clear_session() 和 del 模型（Keras with Tensorflow-gpu）？

提问by benjamin

回答by Chris Swinchatt

回答by Tawej

回答by brethvoice

相关推荐

最近更新

标签

Python 我需要什么 K.clear_session() 和 del 模型（Keras with Tensorflow-gpu）？

提问by benjamin

回答by Chris Swinchatt

回答by Tawej

回答by brethvoice

相关推荐

Python 如何让用户从有限列表中选择输入？

Python tensorflow cifar10_eval.py error:RuntimeError: Attempted to use a closed Session.RuntimeError: Attempted to use a closed

Python AttributeError: 'datetime.datetime' 对象没有属性 'timestamp'

Python 使用请求通过 http 下载文件时的进度条

相关推荐

最近更新

标签