Python 如何从示例队列中将数据读入 TensorFlow 批次?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37126108/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:54:14  来源:igfitidea点击:

How to read data into TensorFlow batches from example queue?

pythonnumpyclassificationtensorflow

提问by JohnAllen

How do I get TensorFlow example queues into proper batches for training?

如何将 TensorFlow 示例队列分成适当的批次进行训练?

I've got some images and labels:

我有一些图像和标签:

IMG_6642.JPG 1
IMG_6643.JPG 2

(feel free to suggest another label format; I think I may need another dense to sparse step...)

(随意建议另一种标签格式;我想我可能需要另一个密集到稀疏的步骤......)

I've read through quite a few tutorials but don't quite have it all together yet. Here's what I have, with comments indicating the steps required from TensorFlow's Reading Datapage.

我已经阅读了很多教程,但还没有把它们全部放在一起。这是我所拥有的,注释指出了 TensorFlow 的阅读数据页面所需的步骤。

  1. The list of filenames (optional steps removed for the sake of simplicity)
  2. Filename queue
  3. A Reader for the file format
  4. A decoder for a record read by the reader
  5. Example queue
  1. 文件名列表(为简单起见,删除了可选步骤)
  2. 文件名队列
  3. 文件格式的阅读器
  4. 读取器读取记录的解码器
  5. 示例队列

And after the example queue I need to get this queue into batches for training; that's where I'm stuck...

在示例队列之后,我需要将此队列分批进行训练;这就是我被困的地方......

1. List of filenames

1. 文件名列表

files = tf.train.match_filenames_once('*.JPG')

files = tf.train.match_filenames_once('*.JPG')

4. Filename queue

4.文件名队列

filename_queue = tf.train.string_input_producer(files, num_epochs=None, shuffle=True, seed=None, shared_name=None, name=None)

filename_queue = tf.train.string_input_producer(files, num_epochs=None, shuffle=True, seed=None, shared_name=None, name=None)

5. A reader

5. 读者

reader = tf.TextLineReader() key, value = reader.read(filename_queue)

reader = tf.TextLineReader() key, value = reader.read(filename_queue)

6. A decoder

6. 解码器

record_defaults = [[""], [1]] col1, col2 = tf.decode_csv(value, record_defaults=record_defaults) (I don't think I need this step below because I already have my label in a tensor but I include it anyways)

record_defaults = [[""], [1]] col1, col2 = tf.decode_csv(value, record_defaults=record_defaults) (我认为我不需要下面的这一步,因为我已经在张量中有我的标签,但无论如何我都包含它)

features = tf.pack([col2])

features = tf.pack([col2])

The documentation page has an example to run one image, not get the images and labels into batches:

文档页面有一个示例来运行一个图像,而不是将图像和标签分批:

for i in range(1200): # Retrieve a single instance: example, label = sess.run([features, col5])

for i in range(1200): # Retrieve a single instance: example, label = sess.run([features, col5])

And then below it has a batching section:

然后在它下面有一个批处理部分:

def read_my_file_format(filename_queue):
  reader = tf.SomeReader()
  key, record_string = reader.read(filename_queue)
  example, label = tf.some_decoder(record_string)
  processed_example = some_processing(example)
  return processed_example, label

def input_pipeline(filenames, batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer(
  filenames, num_epochs=num_epochs, shuffle=True)
  example, label = read_my_file_format(filename_queue)
  # min_after_dequeue defines how big a buffer we will randomly sample
  #   from -- bigger means better shuffling but slower start up and more
  #   memory used.
  # capacity must be larger than min_after_dequeue and the amount larger
  #   determines the maximum we will prefetch.  Recommendation:
  #   min_after_dequeue + (num_threads + a small safety margin) *              batch_size
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
  [example, label], batch_size=batch_size, capacity=capacity,
  min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

My question is: how do I use the above example code with the code I have above?I need batchesto work with, and most of the tutorials come with mnist batches already.

我的问题是:如何将上面的示例代码与上面的代码一起使用?我需要使用批处理,并且大多数教程已经带有 mnist 批处理。

with tf.Session() as sess:
  sess.run(init)

  # Training cycle
for epoch in range(training_epochs):
    total_batch = int(mnist.train.num_examples/batch_size)
    # Loop over all batches
    for i in range(total_batch):
        batch_xs, batch_ys = mnist.train.next_batch(batch_size)

回答by user5869947

If you wish to make this input pipeline work, you will need add an asynchronous queue'ing mechanism that generate batches of examples. This is performed by creating a tf.RandomShuffleQueueor a tf.FIFOQueueand inserting JPEG images that have been read, decoded and preprocessed.

如果你想让这个输入管道工作,你需要添加一个异步队列机制来生成批量示例。这是通过创建 atf.RandomShuffleQueue或 atf.FIFOQueue并插入已读取、解码和预处理的 JPEG 图像来执行的。

You can use handy constructs that will generate the Queues and the corresponding threads for running the queues via tf.train.shuffle_batch_joinor tf.train.batch_join. Here is a simplified example of what this would like. Note that this code is untested:

您可以使用方便的构造来生成队列和相应的线程,以便通过tf.train.shuffle_batch_join或运行队列tf.train.batch_join。这是一个简单的例子,说明这是什么。请注意,此代码未经测试:

# Let's assume there is a Queue that maintains a list of all filenames
# called 'filename_queue'
_, file_buffer = reader.read(filename_queue)

# Decode the JPEG images
images = []
image = decode_jpeg(file_buffer)

# Generate batches of images of this size.
batch_size = 32

# Depends on the number of files and the training speed.
min_queue_examples = batch_size * 100
images_batch = tf.train.shuffle_batch_join(
  image,
  batch_size=batch_size,
  capacity=min_queue_examples + 3 * batch_size,
  min_after_dequeue=min_queue_examples)

# Run your network on this batch of images.
predictions = my_inference(images_batch)

Depending on how you need to scale up your job, you might need to run multiple independent threads that read/decode/preprocess images and dump them in your example queue. A complete example of such a pipeline is provided in the Inception/ImageNet model. Take a look at batch_inputs:

根据您需要如何扩展您的工作,您可能需要运行多个独立线程来读取/解码/预处理图像并将它们转储到您的示例队列中。Inception/ImageNet 模型中提供了此类管道的完整示例。看看batch_inputs

https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L407

https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L407

Finally, if you are working with >O(1000) JPEG images, keep in mind that it is extremely inefficient to individually ready 1000's of small files. This will slow down your training quite a bit.

最后,如果您正在处理 >O(1000) 个 JPEG 图像,请记住,单独准备 1000 个小文件是非常低效的。这会大大减慢你的训练速度。

A more robust and faster solution to convert a dataset of images to a sharded TFRecordof Exampleprotos. Here is a fully worked scriptfor converting the ImageNet data set to such a format. And here is a set of instructionsfor running a generic version of this preprocessing script on an arbitrary directory containing JPEG images.

将图像数据集转换TFRecordExample原型分片的更强大、更快速的解决方案。这是一个完整的脚本,用于将 ImageNet 数据集转换为这种格式。这是一组指令,用于在包含 JPEG 图像的任意目录上运行此预处理脚本的通用版本。