Python TensorFlow - 一次从 TFRecords 中读取所有示例？

Question

提问by golmschenk

How do you read all examples from a TFRecords at once?

您如何一次读取 TFRecords 中的所有示例？

I've been using tf.parse_single_exampleto read out individual examples using code similar to that given in the method read_and_decodein the example of the fully_connected_reader. However, I want to run the network against my entire validation dataset at once, and so would like to load them in their entirety instead.

我一直在使用与 full_connected_reader 示例tf.parse_single_example中的方法read_and_decode中给出的代码类似的代码来读取单个示例。但是，我想一次针对我的整个验证数据集运行网络，因此想要完整地加载它们。

I'm not entirely sure, but the documentationseems to suggest I can use tf.parse_exampleinstead of tf.parse_single_exampleto load the entire TFRecords file at once. I can't seem to get this to work though. I'm guessing it has to do with how I specify the features, but I'm not sure how in the feature specification to state that there are multiple examples.

我不完全确定，但文档似乎建议我可以使用tf.parse_example而不是tf.parse_single_example一次加载整个 TFRecords 文件。不过，我似乎无法让它发挥作用。我猜这与我如何指定功能有关，但我不确定功能规范中如何说明有多个示例。

In other words, my attempt of using something similar to:

换句话说，我尝试使用类似的东西：

reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_example(serialized_example, features={
    'image_raw': tf.FixedLenFeature([], tf.string),
    'label': tf.FixedLenFeature([], tf.int64),
})

isn't working, and I assume it's because the features aren't expecting multiple examples at once (but again, I'm not sure). [This results in an error of ValueError: Shape () must have rank 1]

不起作用，我认为这是因为这些功能不期望一次有多个示例（但同样，我不确定）。[这导致错误ValueError: Shape () must have rank 1]

Is this the proper way to read all the records at once? And if so, what do I need to change to actually read the records? Thank you much!

这是一次读取所有记录的正确方法吗？如果是这样，我需要更改什么才能实际读取记录？非常感谢！

Answer 1

回答by Andrew Pierno

Just for clarity, I have a few thousand images in a single .tfrecords file, they're 720 by 720 rgb png files. The labels are one of 0,1,2,3.

为了清楚起见，我在一个 .tfrecords 文件中有几千张图像，它们是 720 x 720 rgb png 文件。标签是 0、1、2、3 之一。

I also tried using the parse_example and couldn't make it work but this solution works with the parse_single_example.

我也尝试使用 parse_example 并且无法使其工作，但此解决方案适用于 parse_single_example。

The downside is that right now I have to know how many items are in each .tf record, which is kind of a bummer. If I find a better way, I'll update the answer. Also, be careful going out of bounds of the number of records in the .tfrecords file, it will start over at the first record if you loop past the last record

缺点是现在我必须知道每个 .tf 记录中有多少项目，这有点令人失望。如果我找到更好的方法，我会更新答案。另外，请注意超出 .tfrecords 文件中记录数的范围，如果循环超过最后一条记录，它将从第一条记录开始

The trick was to have the queue runner use a coordinator.

诀窍是让队列运行器使用协调器。

I left some code in here to save the images as they're being read in so that you can verify the image is correct.

我在这里留下了一些代码来保存正在读取的图像，以便您可以验证图像是否正确。

from PIL import Image
import numpy as np
import tensorflow as tf

def read_and_decode(filename_queue):
 reader = tf.TFRecordReader()
 _, serialized_example = reader.read(filename_queue)
 features = tf.parse_single_example(
  serialized_example,
  # Defaults are not specified since both keys are required.
  features={
      'image_raw': tf.FixedLenFeature([], tf.string),
      'label': tf.FixedLenFeature([], tf.int64),
      'height': tf.FixedLenFeature([], tf.int64),
      'width': tf.FixedLenFeature([], tf.int64),
      'depth': tf.FixedLenFeature([], tf.int64)
  })
 image = tf.decode_raw(features['image_raw'], tf.uint8)
 label = tf.cast(features['label'], tf.int32)
 height = tf.cast(features['height'], tf.int32)
 width = tf.cast(features['width'], tf.int32)
 depth = tf.cast(features['depth'], tf.int32)
 return image, label, height, width, depth


def get_all_records(FILE):
 with tf.Session() as sess:
   filename_queue = tf.train.string_input_producer([ FILE ])
   image, label, height, width, depth = read_and_decode(filename_queue)
   image = tf.reshape(image, tf.pack([height, width, 3]))
   image.set_shape([720,720,3])
   init_op = tf.initialize_all_variables()
   sess.run(init_op)
   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)
   for i in range(2053):
     example, l = sess.run([image, label])
     img = Image.fromarray(example, 'RGB')
     img.save( "output/" + str(i) + '-train.png')

     print (example,l)
   coord.request_stop()
   coord.join(threads)

get_all_records('/path/to/train-0.tfrecords')

Answer 2

回答by sygi

To read all the data just once, you need to pass num_epochsto the string_input_producer. When all the record are read, the .readmethod of reader will throw an error, which you can catch. Simplified example:

要一次性读取所有数据，您需要传递num_epochs给string_input_producer. 当所有的记录都被读取时，.readreader的方法会抛出一个错误，你可以捕获它。简化示例：

import tensorflow as tf

def read_and_decode(filename_queue):
 reader = tf.TFRecordReader()
 _, serialized_example = reader.read(filename_queue)
 features = tf.parse_single_example(
  serialized_example,
  features={
      'image_raw': tf.FixedLenFeature([], tf.string)
  })
 image = tf.decode_raw(features['image_raw'], tf.uint8)
 return image


def get_all_records(FILE):
 with tf.Session() as sess:
   filename_queue = tf.train.string_input_producer([FILE], num_epochs=1)
   image = read_and_decode(filename_queue)
   init_op = tf.initialize_all_variables()
   sess.run(init_op)
   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)
   try:
     while True:
       example = sess.run([image])
   except tf.errors.OutOfRangeError, e:
     coord.request_stop(e)
   finally:
     coord.request_stop()
     coord.join(threads)

get_all_records('/path/to/train-0.tfrecords')

And to use tf.parse_example(which is fasterthan tf.parse_single_example) you need to first batch the examples like that:

要使用tf.parse_example（比快tf.parse_single_example），您需要首先批量处理这样的示例：

batch = tf.train.batch([serialized_example], num_examples, capacity=num_examples)
parsed_examples = tf.parse_example(batch, feature_spec)

Unfortunately this way you'd need to know the num of examples beforehand.

不幸的是，这样您需要事先知道示例的数量。

Answer 3

回答by Salvador Dali

If you need to read all the data from TFRecord at once, you can write way easier solution just in a few lines of code using tf_record_iterator:

如果您需要一次从 TFRecord 读取所有数据，您可以使用tf_record_iterator在几行代码中编写更简单的解决方案：

An iterator that read the records from a TFRecords file.

从 TFRecords 文件读取记录的迭代器。

To do this, you just:

为此，您只需：

create an example
iterate over records from the iterator
parse each record and read each feature depending on its type

创建一个例子
迭代来自迭代器的记录
解析每条记录并根据其类型读取每个特征

Here is an example with explanation how to read each type.

这是一个示例，说明如何阅读每种类型。

example = tf.train.Example()
for record in tf.python_io.tf_record_iterator(<tfrecord_file>):
    example.ParseFromString(record)
    f = example.features.feature
    v1 = f['int64 feature'].int64_list.value[0]
    v2 = f['float feature'].float_list.value[0]
    v3 = f['bytes feature'].bytes_list.value[0]
    # for bytes you might want to represent them in a different way (based on what they were before saving)
    # something like `np.fromstring(f['img'].bytes_list.value[0], dtype=np.uint8
    # Now do something with your v1/v2/v3

Answer 4

回答by Shen Fei

You can also use tf.python_io.tf_record_iteratorto manually iterate all examples in a TFRecord.

您还可以使用tf.python_io.tf_record_iterator手动迭代TFRecord.

I test that with an illustration code below:

我用下面的插图代码测试它：

import tensorflow as tf

X = [[1, 2],
     [3, 4],
     [5, 6]]


def _int_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


def dump_tfrecord(data, out_file):
    writer = tf.python_io.TFRecordWriter(out_file)
    for x in data:
        example = tf.train.Example(
            features=tf.train.Features(feature={
                'x': _int_feature(x)
            })
        )
        writer.write(example.SerializeToString())
    writer.close()


def load_tfrecord(file_name):
    features = {'x': tf.FixedLenFeature([2], tf.int64)}
    data = []
    for s_example in tf.python_io.tf_record_iterator(file_name):
        example = tf.parse_single_example(s_example, features=features)
        data.append(tf.expand_dims(example['x'], 0))
    return tf.concat(0, data)


if __name__ == "__main__":
    dump_tfrecord(X, 'test_tfrecord')
    print('dump ok')
    data = load_tfrecord('test_tfrecord')

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        Y = sess.run([data])
        print(Y)

Of course you have to use your own featurespecification.

当然，您必须使用自己的feature规范。

The disadvantage is that I don't how to use multi-threads in this way. However, the most occasion we read all examples is when we evaluate validation data set, which is usually not very big. So I think the efficiency may be not a bottleneck.

缺点是我不知道如何以这种方式使用多线程。但是，我们阅读所有示例的最常见情况是在评估验证数据集时，这通常不是很大。所以我认为效率可能不是瓶颈。

And I have another issue when I test this problem, which is that I have to specify the feature length. Instead of tf.FixedLenFeature([], tf.int64), I have to write tf.FixedLenFeature([2], tf.int64), otherwise, an InvalidArgumentErroroccured. I've no idea how to avoid this.

我在测试这个问题时还有另一个问题，那就是我必须指定特征长度。而不是tf.FixedLenFeature([], tf.int64)，我必须写tf.FixedLenFeature([2], tf.int64)，否则，InvalidArgumentError发生了。我不知道如何避免这种情况。

Python: 3.4
Tensorflow: 0.12.0

Python：3.4 张量
流：0.12.0

Answer 5

回答by user2538491

I don't know whether it is still a active topic. I'd like to share the best practice I know so far, it is a question a year ago though.

我不知道它是否仍然是一个活跃的话题。我想分享迄今为止我所知道的最佳实践，但这是一年前的问题。

In the tensorflow, we have a very useful method for the problem like this -- read or iterate the whole bunch of input data and generate training for testing data set randomly. 'tf.train.shuffle_batch' can generate the dataset base on the input stream (like reader.read()) on you behave. Like for example, you can generate a set 1000 dataset, by providing argument list like this:

在 tensorflow 中，我们有一个非常有用的方法来解决这样的问题——读取或迭代整堆输入数据并随机生成测试数据集的训练。'tf.train.shuffle_batch' 可以根据您的行为生成基于输入流（如 reader.read()）的数据集。例如，您可以通过提供如下参数列表来生成一组 1000 数据集：

reader = tf.TFRecordReader()
_, serialized = reader.read(filename_queue)
features = tf.parse_single_example(
    serialized,
    features={
        'label': tf.FixedLenFeature([], tf.string),
        'image': tf.FixedLenFeature([], tf.string)
    }
)
record_image = tf.decode_raw(features['image'], tf.uint8)

image = tf.reshape(record_image, [500, 500, 1])
label = tf.cast(features['label'], tf.string)
min_after_dequeue = 10
batch_size = 1000
capacity = min_after_dequeue + 3 * batch_size
image_batch, label_batch = tf.train.shuffle_batch(
    [image, label], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue
)

Answer 6

回答by user2538491

Besides, if you don't think 'tf.train.shuffle_batch' is the way you need. You may try combination of tf.TFRecordReader().read_up_to() and tf.parse_example() as well. Here's the example for your reference:

此外，如果您认为 'tf.train.shuffle_batch' 不是您需要的方式。您也可以尝试 tf.TFRecordReader().read_up_to() 和 tf.parse_example() 的组合。以下是供您参考的示例：

def read_tfrecords(folder_name, bs):
    filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(glob.glob(folder_name + "/*.tfrecords")))
    reader = tf.TFRecordReader()
    _, serialized = reader.read_up_to(filename_queue, bs)
    features = tf.parse_example(
        serialized,
        features={
            'label': tf.FixedLenFeature([], tf.string),
            'image': tf.FixedLenFeature([], tf.string)
        }
    )
    record_image = tf.decode_raw(features['image'], tf.uint8)
    image = tf.reshape(record_image, [-1, 250, 151, 1])
    label = tf.cast(features['label'], tf.string)
    return image, label

Python TensorFlow - 一次从 TFRecords 中读取所有示例？

提问by golmschenk

回答by Andrew Pierno

回答by sygi

回答by Salvador Dali

回答by Shen Fei

回答by user2538491

回答by user2538491

相关推荐

最近更新

标签

Python TensorFlow - 一次从 TFRecords 中读取所有示例？

提问by golmschenk

回答by Andrew Pierno

回答by sygi

回答by Salvador Dali

回答by Shen Fei

回答by user2538491

回答by user2538491

相关推荐

如果通过，如果在python中继续

Python aws lambda 无法导入模块“lambda_function”：没有名为“requests”的模块

Python Pip“找不到满足要求的”

Python TensorFlow对象检测API教程中获取bounding box坐标

相关推荐

最近更新

标签