pandas 如何在 RNN TensorFlow 中使用非常大的数据集？

Question

提问by afagarap

I have a very large dataset: 7.9 GB of CSV files. 80% of which shall serve as the training data, and the remaining 20% shall serve as test data. When I'm loading the training data (6.2 GB), I'm having MemoryErrorat the 80th iteration (80th file). Here's the script I'm using in loading the data:

我有一个非常大的数据集：7.9 GB 的 CSV 文件。其中80%作为训练数据，其余20%作为测试数据。当我加载训练数据 (6.2 GB) 时，我MemoryError处于第 80 次迭代（第 80 个文件）。这是我在加载数据时使用的脚本：

import pandas as pd
import os

col_names = ['duration', 'service', 'src_bytes', 'dest_bytes', 'count', 'same_srv_rate',
        'serror_rate', 'srv_serror_rate', 'dst_host_count', 'dst_host_srv_count',
        'dst_host_same_src_port_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
        'flag', 'ids_detection', 'malware_detection', 'ashula_detection', 'label', 'src_ip_add',
        'src_port_num', 'dst_ip_add', 'dst_port_num', 'start_time', 'protocol']

# create a list to store the filenames
files = []

# create a dataframe to store the contents of CSV files
df = pd.DataFrame()

# get the filenames in the specified PATH
for (dirpath, dirnames, filenames) in os.walk(path):
    ''' Append to the list the filenames under the subdirectories of the <path> '''
    files.extend(os.path.join(dirpath, filename) for filename in filenames)

for file in files:
    df = df.append(pd.read_csv(filepath_or_buffer=file, names=col_names, engine='python'))
    print('Appending file : {file}'.format(file=files[index]))

pd.set_option('display.max_colwidth', -1)
print(df)

There are 130 files in the 6.2 GB worth of CSV files.

6.2 GB 的 CSV 文件中有 130 个文件。

Answer 1

回答by Nyps

For large datasets - and we may already count 6.2GB as large - reading all the data in at once might not be the best idea. As you are going to train your network batch by batch anyway, it is sufficient to only load the data you need for the batch which is going to be used next.

对于大型数据集 - 我们可能已经将 6.2GB 视为大数据集 - 一次读取所有数据可能不是最好的主意。无论如何，您将要逐批训练网络，因此只需加载接下来将要使用的批处理所需的数据就足够了。

The tensorflow documentationprovides a good overview on how to implement a data reading pipeline. Stages according to the documentation linked are:

该tensorflow文档提供了有关如何实现数据读取管道一个很好的概述。根据链接的文档，阶段是：

The list of filenames
Optional filename shuffling
Optional epoch limit
Filename queue
A Reader for the file format
A decoder for a record read by the reader
Optional preprocessing
Example queue

文件名列表
可选的文件名改组
可选的纪元限制
文件名队列
文件格式的阅读器
读取器读取记录的解码器
可选的预处理
示例队列

Answer 2

回答by John Scolaro

I second Nyps's answer, I just don't have enough reputation to add a comment just yet. Additionally, it might be interesting for you to open Task Manager or equivalent and observe the used memory of your system as you run this. I would guess that when your RAM entirely fills up, that's when you're getting your error.

我支持 Nyps 的回答，我只是没有足够的声誉来添加评论。此外，打开任务管理器或等效程序并在运行时观察系统的已用内存可能会很有趣。我猜当你的 RAM 完全填满时，那就是你收到错误的时候。

TensorFlow supports queues, which allow you to only read portions of data at once, in order to not exhaust your memory. Examples for this are in the documentation that Nyps linked. Also, TensorFlow has recently added a new way to handle input datasets in TensorFlow Dataset docs.

TensorFlow 支持队列，它允许您一次只读取部分数据，以免耗尽您的内存。这方面的例子在 Nyps 链接的文档中。此外，TensorFlow 最近在TensorFlow Dataset docs 中添加了一种处理输入数据集的新方法。

Also, I would suggest converting all your data to TensorFlow's TFRecord format, as it will save space, and can speed up data accessing over 100 times compared to converting CSV files to tensors at training time.

此外，我建议将您的所有数据转换为 TensorFlow 的 TFRecord 格式，因为与在训练时将 CSV 文件转换为张量相比，它可以节省空间，并且可以将数据访问速度提高 100 倍以上。

pandas 如何在 RNN TensorFlow 中使用非常大的数据集？

提问by afagarap

回答by Nyps

回答by John Scolaro

相关推荐

最近更新

标签

pandas 如何在 RNN TensorFlow 中使用非常大的数据集？

提问by afagarap

回答by Nyps

回答by John Scolaro

相关推荐

Python pandas - 如果项目在列表中，则新列的值

如何在 Pandas 中对数据透视表进行排序

如何仅将 dtype bool 列的 Pandas 数据框中的 True 和 False 映射到“是”和“否”？

Pandas - 删除 DataFrame 的最后一列

相关推荐

最近更新

标签