Python 如何从示例队列中将数据读入 TensorFlow 批次?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37126108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read data into TensorFlow batches from example queue?
提问by JohnAllen
How do I get TensorFlow example queues into proper batches for training?
如何将 TensorFlow 示例队列分成适当的批次进行训练?
I've got some images and labels:
我有一些图像和标签:
IMG_6642.JPG 1
IMG_6643.JPG 2
(feel free to suggest another label format; I think I may need another dense to sparse step...)
(随意建议另一种标签格式;我想我可能需要另一个密集到稀疏的步骤......)
I've read through quite a few tutorials but don't quite have it all together yet. Here's what I have, with comments indicating the steps required from TensorFlow's Reading Datapage.
我已经阅读了很多教程,但还没有把它们全部放在一起。这是我所拥有的,注释指出了 TensorFlow 的阅读数据页面所需的步骤。
- The list of filenames (optional steps removed for the sake of simplicity)
- Filename queue
- A Reader for the file format
- A decoder for a record read by the reader
- Example queue
- 文件名列表(为简单起见,删除了可选步骤)
- 文件名队列
- 文件格式的阅读器
- 读取器读取记录的解码器
- 示例队列
And after the example queue I need to get this queue into batches for training; that's where I'm stuck...
在示例队列之后,我需要将此队列分批进行训练;这就是我被困的地方......
1. List of filenames
1. 文件名列表
files = tf.train.match_filenames_once('*.JPG')
files = tf.train.match_filenames_once('*.JPG')
4. Filename queue
4.文件名队列
filename_queue = tf.train.string_input_producer(files, num_epochs=None, shuffle=True, seed=None, shared_name=None, name=None)
filename_queue = tf.train.string_input_producer(files, num_epochs=None, shuffle=True, seed=None, shared_name=None, name=None)
5. A reader
5. 读者
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
6. A decoder
6. 解码器
record_defaults = [[""], [1]]
col1, col2 = tf.decode_csv(value, record_defaults=record_defaults)
(I don't think I need this step below because I already have my label in a tensor but I include it anyways)
record_defaults = [[""], [1]]
col1, col2 = tf.decode_csv(value, record_defaults=record_defaults)
(我认为我不需要下面的这一步,因为我已经在张量中有我的标签,但无论如何我都包含它)
features = tf.pack([col2])
features = tf.pack([col2])
The documentation page has an example to run one image, not get the images and labels into batches:
文档页面有一个示例来运行一个图像,而不是将图像和标签分批:
for i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
for i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
And then below it has a batching section:
然后在它下面有一个批处理部分:
def read_my_file_format(filename_queue):
reader = tf.SomeReader()
key, record_string = reader.read(filename_queue)
example, label = tf.some_decoder(record_string)
processed_example = some_processing(example)
return processed_example, label
def input_pipeline(filenames, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
filenames, num_epochs=num_epochs, shuffle=True)
example, label = read_my_file_format(filename_queue)
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a small safety margin) * batch_size
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
My question is: how do I use the above example code with the code I have above?I need batchesto work with, and most of the tutorials come with mnist batches already.
我的问题是:如何将上面的示例代码与上面的代码一起使用?我需要使用批处理,并且大多数教程已经带有 mnist 批处理。
with tf.Session() as sess:
sess.run(init)
# Training cycle
for epoch in range(training_epochs):
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size)
回答by user5869947
If you wish to make this input pipeline work, you will need add an asynchronous queue'ing mechanism that generate batches of examples. This is performed by creating a tf.RandomShuffleQueue
or a tf.FIFOQueue
and inserting JPEG images that have been read, decoded and preprocessed.
如果你想让这个输入管道工作,你需要添加一个异步队列机制来生成批量示例。这是通过创建 atf.RandomShuffleQueue
或 atf.FIFOQueue
并插入已读取、解码和预处理的 JPEG 图像来执行的。
You can use handy constructs that will generate the Queues and the corresponding threads for running the queues via tf.train.shuffle_batch_join
or tf.train.batch_join
. Here is a simplified example of what this would like. Note that this code is untested:
您可以使用方便的构造来生成队列和相应的线程,以便通过tf.train.shuffle_batch_join
或运行队列tf.train.batch_join
。这是一个简单的例子,说明这是什么。请注意,此代码未经测试:
# Let's assume there is a Queue that maintains a list of all filenames
# called 'filename_queue'
_, file_buffer = reader.read(filename_queue)
# Decode the JPEG images
images = []
image = decode_jpeg(file_buffer)
# Generate batches of images of this size.
batch_size = 32
# Depends on the number of files and the training speed.
min_queue_examples = batch_size * 100
images_batch = tf.train.shuffle_batch_join(
image,
batch_size=batch_size,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
# Run your network on this batch of images.
predictions = my_inference(images_batch)
Depending on how you need to scale up your job, you might need to run multiple independent threads that read/decode/preprocess images and dump them in your example queue. A complete example of such a pipeline is provided in the Inception/ImageNet model. Take a look at batch_inputs
:
根据您需要如何扩展您的工作,您可能需要运行多个独立线程来读取/解码/预处理图像并将它们转储到您的示例队列中。Inception/ImageNet 模型中提供了此类管道的完整示例。看看batch_inputs
:
https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L407
https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L407
Finally, if you are working with >O(1000) JPEG images, keep in mind that it is extremely inefficient to individually ready 1000's of small files. This will slow down your training quite a bit.
最后,如果您正在处理 >O(1000) 个 JPEG 图像,请记住,单独准备 1000 个小文件是非常低效的。这会大大减慢你的训练速度。
A more robust and faster solution to convert a dataset of images to a sharded TFRecord
of Example
protos. Here is a fully worked scriptfor converting the ImageNet data set to such a format. And here is a set of instructionsfor running a generic version of this preprocessing script on an arbitrary directory containing JPEG images.
将图像数据集转换TFRecord
为Example
原型分片的更强大、更快速的解决方案。这是一个完整的脚本,用于将 ImageNet 数据集转换为这种格式。这是一组指令,用于在包含 JPEG 图像的任意目录上运行此预处理脚本的通用版本。