Python 如何*实际*读取 TensorFlow 中的 CSV 数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37091899/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:50:00  来源:igfitidea点击:

How to *actually* read CSV data in TensorFlow?

pythoncsvtensorflow

提问by Rob

I'm relatively new to the world of TensorFlow, and pretty perplexed by how you'd actuallyread CSV data into a usable example/label tensors in TensorFlow. The example from the TensorFlow tutorial on reading CSV datais pretty fragmented and only gets you part of the way to being able to train on CSV data.

我对 TensorFlow 的世界相对较新,并且对如何将CSV 数据实际读入 TensorFlow 中可用的示例/标签张量感到非常困惑。TensorFlow 教程中关于读取 CSV 数据的示例非常零散,只能让您部分地学习 CSV 数据。

Here's my code that I've pieced together, based off that CSV tutorial:

这是我根据 CSV 教程拼凑起来的代码:

from __future__ import print_function
import tensorflow as tf

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

filename = "csv_test_data.csv"

# setup text reader
file_length = file_len(filename)
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)

# setup CSV decoding
record_defaults = [[0],[0],[0],[0],[0]]
col1,col2,col3,col4,col5 = tf.decode_csv(csv_row, record_defaults=record_defaults)

# turn features back into a tensor
features = tf.stack([col1,col2,col3,col4])

print("loading, " + str(file_length) + " line(s)\n")
with tf.Session() as sess:
  tf.initialize_all_variables().run()

  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(file_length):
    # retrieve a single instance
    example, label = sess.run([features, col5])
    print(example, label)

  coord.request_stop()
  coord.join(threads)
  print("\ndone loading")

And here is an brief example from the CSV file I'm loading - pretty basic data - 4 feature columns, and 1 label column:

这是我正在加载的 CSV 文件中的一个简短示例 - 非常基本的数据 - 4 个特征列和 1 个标签列:

0,0,0,0,0
0,15,0,0,0
0,30,0,0,0
0,45,0,0,0

All the code above does is print each example from the CSV file, one by one, which, while nice, is pretty darn useless for training.

上面的代码所做的就是将 CSV 文件中的每个示例一个一个地打印出来,虽然很好,但对于训练来说却毫无用处。

What I'm struggling with here is how you'd actually turn those individual examples, loaded one-by-one, into a training dataset. For example, here's a notebookI was working on in the Udacity Deep Learning course. I basically want to take the CSV data I'm loading, and plop it into something like train_datasetand train_labels:

我在这里苦苦挣扎的是,您实际上如何将那些一个接一个加载的单个示例转换为训练数据集。例如,这是我在 Udacity 深度学习课程中使用的笔记本。我基本上想获取我正在加载的 CSV 数据,并将其放入类似train_datasettrain_labels 的内容中

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

I've tried using tf.train.shuffle_batch, like this, but it just inexplicably hangs:

我试过tf.train.shuffle_batch像这样使用,但它莫名其妙地挂起:

  for i in range(file_length):
    # retrieve a single instance
    example, label = sess.run([features, colRelevant])
    example_batch, label_batch = tf.train.shuffle_batch([example, label], batch_size=file_length, capacity=file_length, min_after_dequeue=10000)
    print(example, label)

So to sum up, here are my questions:

总而言之,这里是我的问题:

  • What am I missing about this process?
    • It feels like there is some key intuition that I'm missing about how to properly build an input pipeline.
  • Is there a way to avoid having to know the length of the CSV file?
    • It feels pretty inelegant to have to know the number of lines you want to process (the for i in range(file_length)line of code above)
  • 我在这个过程中遗漏了什么?
    • 感觉就像我缺少一些关于如何正确构建输入管道的关键直觉。
  • 有没有办法避免必须知道 CSV 文件的长度?
    • 必须知道要处理的for i in range(file_length)行数(上面的代码行)感觉很不雅观


Edit:As soon as Yaroslav pointed out that I was likely mixing up imperative and graph-construction parts here, it started to become clearer. I was able to pull together the following code, which I think is closer to what would typically done when training a model from CSV (excluding any model training code):

编辑:一旦 Yaroslav 指出我可能在这里混淆了命令式和图形构造部分,它就开始变得更加清晰。我能够汇总以下代码,我认为这更接近于从 CSV 训练模型时通常会做的事情(不包括任何模型训练代码):

from __future__ import print_function
import numpy as np
import tensorflow as tf
import math as math
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('dataset')
args = parser.parse_args()

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def read_from_csv(filename_queue):
  reader = tf.TextLineReader(skip_header_lines=1)
  _, csv_row = reader.read(filename_queue)
  record_defaults = [[0],[0],[0],[0],[0]]
  colHour,colQuarter,colAction,colUser,colLabel = tf.decode_csv(csv_row, record_defaults=record_defaults)
  features = tf.stack([colHour,colQuarter,colAction,colUser])  
  label = tf.stack([colLabel])  
  return features, label

def input_pipeline(batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer([args.dataset], num_epochs=num_epochs, shuffle=True)  
  example, label = read_from_csv(filename_queue)
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

file_length = file_len(args.dataset) - 1
examples, labels = input_pipeline(file_length, 1)

with tf.Session() as sess:
  tf.initialize_all_variables().run()

  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  try:
    while not coord.should_stop():
      example_batch, label_batch = sess.run([examples, labels])
      print(example_batch)
  except tf.errors.OutOfRangeError:
    print('Done training, epoch reached')
  finally:
    coord.request_stop()

  coord.join(threads) 

采纳答案by Yaroslav Bulatov

I think you are mixing up imperative and graph-construction parts here. The operation tf.train.shuffle_batchcreates a new queue node, and a single node can be used to process the entire dataset. So I think you are hanging because you created a bunch of shuffle_batchqueues in your for loop and didn't start queue runners for them.

我认为您在这里混淆了命令式和图形构建部分。该操作tf.train.shuffle_batch会创建一个新的队列节点,单个节点可用于处理整个数据集。所以我认为你挂了是因为你shuffle_batch在 for 循环中创建了一堆队列并且没有为它们启动队列运行器。

Normal input pipeline usage looks like this:

正常的输入管道用法如下所示:

  1. Add nodes like shuffle_batchto input pipeline
  2. (optional, to prevent unintentional graph modification) finalize graph
  1. 添加节点喜欢shuffle_batch输入管道
  2. (可选,以防止无意的图形修改)完成图形

--- end of graph construction, beginning of imperative programming --

--- 图构建结束,命令式编程开始 -

  1. tf.start_queue_runners
  2. while(True): session.run()
  1. tf.start_queue_runners
  2. while(True): session.run()

To be more scalable (to avoid Python GIL), you could generate all of your data using TensorFlow pipeline. However, if performance is not critical, you can hook up a numpy array to an input pipeline by using slice_input_producer.Here's an example with some Printnodes to see what's going on (messages in Printgo to stdout when node is run)

为了更具可扩展性(避免 Python GIL),您可以使用 TensorFlow 管道生成所有数据。但是,如果性能不重要,您可以使用slice_input_producer.以下示例将 numpy 数组连接到输入管道,其中包含一些Print节点以查看发生了什么(Print节点运行时消息进入标准输出)

tf.reset_default_graph()

num_examples = 5
num_features = 2
data = np.reshape(np.arange(num_examples*num_features), (num_examples, num_features))
print data

(data_node,) = tf.slice_input_producer([tf.constant(data)], num_epochs=1, shuffle=False)
data_node_debug = tf.Print(data_node, [data_node], "Dequeueing from data_node ")
data_batch = tf.batch([data_node_debug], batch_size=2)
data_batch_debug = tf.Print(data_batch, [data_batch], "Dequeueing from data_batch ")

sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
tf.get_default_graph().finalize()
tf.start_queue_runners()

try:
  while True:
    print sess.run(data_batch_debug)
except tf.errors.OutOfRangeError as e:
  print "No more inputs."

You should see something like this

你应该看到这样的东西

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
[[0 1]
 [2 3]]
[[4 5]
 [6 7]]
No more inputs.

The "8, 9" numbers didn't fill up the full batch, so they didn't get produced. Also tf.Printare printed to sys.stdout, so they show up in separately in Terminal for me.

“8、9”号没有填满整批,所以它们没有被生产出来。也tf.Print被打印到 sys.stdout,所以它们在我的终端中单独显示。

PS: a minimal of connecting batchto a manually initialized queue is in github issue 2193

PS:batchgithub 问题 2193 中有最少的连接到手动初始化的队列

Also, for debugging purposes you might want to set timeouton your session so that your IPython notebook doesn't hang on empty queue dequeues. I use this helper function for my sessions

此外,出于调试目的,您可能希望timeout在会话上进行设置,以便您的 IPython 笔记本不会挂在空队列出队上。我在会话中使用此辅助函数

def create_session():
  config = tf.ConfigProto(log_device_placement=True)
  config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
  config.operation_timeout_in_ms=60000   # terminate on long hangs
  # create interactive session to register a default session
  sess = tf.InteractiveSession("", config=config)
  return sess

Scalability Notes:

可扩展性说明:

  1. tf.constantinlines copy of your data into the Graph. There's a fundamental limit of 2GB on size of Graph definition so that's an upper limit on size of data
  2. You could get around that limit by using v=tf.Variableand saving the data into there by running v.assign_opwith a tf.placeholderon right-hand side and feeding numpy array to the placeholder (feed_dict)
  3. That still creates two copies of data, so to save memory you could make your own version of slice_input_producerwhich operates on numpy arrays, and uploads rows one at a time using feed_dict
  1. tf.constant将您的数据副本内联到图表中。图形定义的大小有 2GB 的基本限制,因此这是数据大小的上限
  2. 你可以避开这一限制使用v=tf.Variable和保存数据到那里通过运行v.assign_op一个tf.placeholder在右侧和喂养numpy的阵列到占位符(feed_dict
  3. 这仍然会创建两个数据副本,因此为了节省内存,您可以制作自己的版本,slice_input_producer该版本在 numpy 数组上运行,并使用一次上传一行feed_dict

回答by Nagarjun Gururaj

Or you could try this, the code loads the Iris dataset into tensorflow using pandas and numpy and a simple one neuron output is printed in the session. Hope it helps for a basic understanding.... [ I havent added the way of one hot decoding labels].

或者你可以试试这个,代码使用 pandas 和 numpy 将 Iris 数据集加载到 tensorflow 中,并在会话中打印一个简单的单神经元输出。希望对基本的理解有所帮助.... [我没有添加一个热解码标签的方式]。

import tensorflow as tf 
import numpy
import pandas as pd
df=pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [0,1,2,3,4],skiprows = [0],header=None)
d = df.values
l = pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [5] ,header=None)
labels = l.values
data = numpy.float32(d)
labels = numpy.array(l,'str')
#print data, labels

#tensorflow
x = tf.placeholder(tf.float32,shape=(150,5))
x = data
w = tf.random_normal([100,150],mean=0.0, stddev=1.0, dtype=tf.float32)
y = tf.nn.softmax(tf.matmul(w,x))

with tf.Session() as sess:
    print sess.run(y)

回答by Adarsh Kumar

You can use latest tf.data API :

您可以使用最新的 tf.data API :

dataset = tf.contrib.data.make_csv_dataset(filepath)
iterator = dataset.make_initializable_iterator()
columns = iterator.get_next()
with tf.Session() as sess:
   sess.run([iteator.initializer])

回答by Hasan Rafiq

If anyone came here searching for a simple way to read absolutely large and sharded CSV files in tf.estimator API then , please see below my code

如果有人来这里寻找一种简单的方法来读取 tf.estimator API 中绝对大且分片的 CSV 文件,请参阅下面我的代码

CSV_COLUMNS = ['ID','text','class']
LABEL_COLUMN = 'class'
DEFAULTS = [['x'],['no'],[0]]  #Default values

def read_dataset(filename, mode, batch_size = 512):
    def _input_fn(v_test=False):
#         def decode_csv(value_column):
#             columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
#             features = dict(zip(CSV_COLUMNS, columns))
#             label = features.pop(LABEL_COLUMN)
#             return add_engineered(features), label

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(filename)

        # Create dataset from file list
        #dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
        dataset = tf.contrib.data.make_csv_dataset(file_list,
                                                   batch_size=batch_size,
                                                   column_names=CSV_COLUMNS,
                                                   column_defaults=DEFAULTS,
                                                   label_name=LABEL_COLUMN)

        if mode == tf.estimator.ModeKeys.TRAIN:
            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
        else:
            num_epochs = 1 # end-of-input after this

        batch_features, batch_labels = dataset.make_one_shot_iterator().get_next()

        #Begins - Uncomment for testing only -----------------------------------------------------<
        if v_test == True:
            with tf.Session() as sess:
                print(sess.run(batch_features))
        #End - Uncomment for testing only -----------------------------------------------------<
        return add_engineered(batch_features), batch_labels
    return _input_fn

Example usage in TF.estimator:

TF.estimator 中的示例用法:

train_spec = tf.estimator.TrainSpec(input_fn = read_dataset(
                                                filename = train_file,
                                                mode = tf.estimator.ModeKeys.TRAIN,
                                                batch_size = 128), 
                                      max_steps = num_train_steps)

回答by Tensorflow Support

2.0 Compatible Solution: This Answer might be provided by others in the above thread but I will provide additional links which will help the community.

2.0 兼容解决方案:此答案可能由上述线程中的其他人提供,但我将提供有助于社区的其他链接。

dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)

For more information, please refer this Tensorflow Tutorial.

有关更多信息,请参阅此Tensorflow 教程