Python 在 Tensorflow 中训练期间 GPU 使用率非常低
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46146757/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Very low GPU usage during training in Tensorflow
提问by Aleksei Petrenko
I am trying to train a simple multi-layer perceptron for a 10-class image classification task, which is a part of the assignment for the Udacity Deep-Learning course. To be more precise, the task is to classify letters rendered from various fonts (the dataset is called notMNIST).
我正在尝试为 10 类图像分类任务训练一个简单的多层感知器,这是 Udacity 深度学习课程作业的一部分。更准确地说,任务是对从各种字体呈现的字母进行分类(数据集称为 notMNIST)。
The code I ended up with looks fairly simple, but no matter what I always get very low GPU usage during training. I measure load with GPU-Z and it shows just 25-30%.
我最终得到的代码看起来相当简单,但无论如何我在训练期间总是得到非常低的 GPU 使用率。我用 GPU-Z 测量负载,它显示只有 25-30%。
Here is my current code:
这是我当前的代码:
graph = tf.Graph()
with graph.as_default():
tf.set_random_seed(52)
# dataset definition
dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
dataset = dataset.shuffle(buffer_size=20000)
dataset = dataset.batch(128)
iterator = dataset.make_initializable_iterator()
sample = iterator.get_next()
x = sample['x']
y = sample['y']
# actual computation graph
keep_prob = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool, name='is_training')
fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
logits = dense(fc3, NUM_CLASSES, 'logits')
with tf.name_scope('accuracy'):
accuracy = tf.reduce_mean(
tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
)
accuracy_percent = 100 * accuracy
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
# ensures that we execute the update_ops before performing the train_op
# needed for batch normalization (apparently)
train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)
with tf.Session(graph=graph) as sess:
tf.global_variables_initializer().run()
step = 0
epoch = 0
while True:
sess.run(iterator.initializer, feed_dict={})
while True:
step += 1
try:
sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
except tf.errors.OutOfRangeError:
logger.info('End of epoch #%d', epoch)
break
# end of epoch
train_l, train_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
)
test_l, test_ac = sess.run(
[loss, accuracy_percent],
feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
)
logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)
epoch += 1
Here's what I tried so far:
这是我到目前为止尝试过的:
I changed the input pipeline from simple
feed_dict
totensorflow.contrib.data.Dataset
. As far as I understood, it is supposed to take care of the efficiency of the input, e.g. load data in a separate thread. So there should not be any bottleneck associated with the input.I collected traces as suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659However, these traces didn't really show anything interesting. >90% of the train step is matmul operations.
Changed batch size. When I change it from 128 to 512 the load increases from ~30% to ~38%, when I increase it further to 2048, the load goes to ~45%. I have 6Gb GPU memory and dataset is single channel 28x28 images. Am I really supposed to use such a big batch size? Should I increase it further?
我将输入管道从 simple 更改
feed_dict
为tensorflow.contrib.data.Dataset
. 据我了解,它应该负责输入的效率,例如在单独的线程中加载数据。所以不应该有任何与输入相关的瓶颈。我按照此处的建议收集了痕迹:https: //github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659但是,这些痕迹并没有真正显示出任何有趣的东西。> 90% 的训练步骤是 matmul 操作。
更改了批量大小。当我将它从 128 更改为 512 时,负载从 ~30% 增加到 ~38%,当我进一步增加到 2048 时,负载增加到 ~45%。我有 6Gb GPU 内存,数据集是单通道 28x28 图像。我真的应该使用这么大的批量吗?我应该进一步增加吗?
Generally, should I worry about the low load, is it really a sign that I am training inefficiently?
一般来说,我应该担心低负荷,这真的是我训练效率低下的迹象吗?
Here's the GPU-Z screenshots with 128 images in the batch. You can see low load with occasional spikes to 100% when I measure accuracy on the entire dataset after each epoch.
这是批处理中包含 128 张图像的 GPU-Z 屏幕截图。当我在每个 epoch 之后测量整个数据集的准确性时,您会看到低负载,偶尔会出现 100% 的峰值。
回答by Yaroslav Bulatov
MNIST size networks are tiny and it's hard to achieve high GPU (or CPU) efficiency for them, I think 30% is not unusual for your application. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models like yours, the statistical efficiency drops off very quickly after a 100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
MNIST 规模的网络很小,很难为它们实现高 GPU(或 CPU)效率,我认为 30% 对您的应用程序来说并不罕见。批量越大,计算效率越高,这意味着您每秒可以处理更多示例,但统计效率也会降低,这意味着您需要总共处理更多示例才能达到目标准确率。所以这是一个权衡。对于像您这样的小角色模型,统计效率在 100 之后会很快下降,因此尝试增加训练批次大小可能不值得。对于推理,您应该使用最大的批量大小。
回答by Contango
On my nVidia GTX 1080, if I use a convolutional neural network on the MNIST database, the GPU load is ~68%.
在我的 nVidia GTX 1080 上,如果我在 MNIST 数据库上使用卷积神经网络,GPU 负载约为 68%。
If I switch to a simple, non-convolutional network, then the GPU load is ~20%.
如果我切换到一个简单的非卷积网络,那么 GPU 负载约为 20%。
You can replicate these results by building successively more advanced models in the tutorial Building Autoencoders in Keras by Francis Chollet.
您可以通过在 Francis Chollet的教程Building Autoencoders in Keras 中连续构建更高级的模型来复制这些结果。