pandas 如何在 RNN TensorFlow 中使用非常大的数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45298988/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use very large dataset in RNN TensorFlow?
提问by afagarap
I have a very large dataset: 7.9 GB of CSV files. 80% of which shall serve as the training data, and the remaining 20% shall serve as test data. When I'm loading the training data (6.2 GB), I'm having MemoryError
at the 80th iteration (80th file). Here's the script I'm using in loading the data:
我有一个非常大的数据集:7.9 GB 的 CSV 文件。其中80%作为训练数据,其余20%作为测试数据。当我加载训练数据 (6.2 GB) 时,我MemoryError
处于第 80 次迭代(第 80 个文件)。这是我在加载数据时使用的脚本:
import pandas as pd
import os
col_names = ['duration', 'service', 'src_bytes', 'dest_bytes', 'count', 'same_srv_rate',
'serror_rate', 'srv_serror_rate', 'dst_host_count', 'dst_host_srv_count',
'dst_host_same_src_port_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
'flag', 'ids_detection', 'malware_detection', 'ashula_detection', 'label', 'src_ip_add',
'src_port_num', 'dst_ip_add', 'dst_port_num', 'start_time', 'protocol']
# create a list to store the filenames
files = []
# create a dataframe to store the contents of CSV files
df = pd.DataFrame()
# get the filenames in the specified PATH
for (dirpath, dirnames, filenames) in os.walk(path):
''' Append to the list the filenames under the subdirectories of the <path> '''
files.extend(os.path.join(dirpath, filename) for filename in filenames)
for file in files:
df = df.append(pd.read_csv(filepath_or_buffer=file, names=col_names, engine='python'))
print('Appending file : {file}'.format(file=files[index]))
pd.set_option('display.max_colwidth', -1)
print(df)
There are 130 files in the 6.2 GB worth of CSV files.
6.2 GB 的 CSV 文件中有 130 个文件。
回答by Nyps
For large datasets - and we may already count 6.2GB as large - reading all the data in at once might not be the best idea. As you are going to train your network batch by batch anyway, it is sufficient to only load the data you need for the batch which is going to be used next.
对于大型数据集 - 我们可能已经将 6.2GB 视为大数据集 - 一次读取所有数据可能不是最好的主意。无论如何,您将要逐批训练网络,因此只需加载接下来将要使用的批处理所需的数据就足够了。
The tensorflow documentationprovides a good overview on how to implement a data reading pipeline. Stages according to the documentation linked are:
该tensorflow文档提供了有关如何实现数据读取管道一个很好的概述。根据链接的文档,阶段是:
- The list of filenames
- Optional filename shuffling
- Optional epoch limit
- Filename queue
- A Reader for the file format
- A decoder for a record read by the reader
- Optional preprocessing
- Example queue
- 文件名列表
- 可选的文件名改组
- 可选的纪元限制
- 文件名队列
- 文件格式的阅读器
- 读取器读取记录的解码器
- 可选的预处理
- 示例队列
回答by John Scolaro
I second Nyps's answer, I just don't have enough reputation to add a comment just yet. Additionally, it might be interesting for you to open Task Manager or equivalent and observe the used memory of your system as you run this. I would guess that when your RAM entirely fills up, that's when you're getting your error.
我支持 Nyps 的回答,我只是没有足够的声誉来添加评论。此外,打开任务管理器或等效程序并在运行时观察系统的已用内存可能会很有趣。我猜当你的 RAM 完全填满时,那就是你收到错误的时候。
TensorFlow supports queues, which allow you to only read portions of data at once, in order to not exhaust your memory. Examples for this are in the documentation that Nyps linked. Also, TensorFlow has recently added a new way to handle input datasets in TensorFlow Dataset docs.
TensorFlow 支持队列,它允许您一次只读取部分数据,以免耗尽您的内存。这方面的例子在 Nyps 链接的文档中。此外,TensorFlow 最近在TensorFlow Dataset docs 中添加了一种处理输入数据集的新方法。
Also, I would suggest converting all your data to TensorFlow's TFRecord format, as it will save space, and can speed up data accessing over 100 times compared to converting CSV files to tensors at training time.
此外,我建议将您的所有数据转换为 TensorFlow 的 TFRecord 格式,因为与在训练时将 CSV 文件转换为张量相比,它可以节省空间,并且可以将数据访问速度提高 100 倍以上。