Pandas.read_csv() 内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42931068/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:14:58  来源:igfitidea点击:

Pandas.read_csv() MemoryError

pythoncsvpandasnumpylarge-files

提问by suhas

I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.

我有一个 1GB 的 csv 文件。该文件有大约 10000000(10 百万)行。我需要遍历行以获得几个选定行的最大值(基于条件)。问题是读取 csv 文件。

I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.

我使用 Python 的 Pandas 包。read_csv() 函数在读取 csv 文件时抛出 MemoryError。1) 我试图将文件拆分成块并读取它们,现在, concat() 函数存在内存问题。

tp  = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float,  'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float,   'vdd_ext_flash_v': float,   'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})

df = pd.concat(tp,ignore_index=True)

I have used the dtype to reduce memory hog, still there is no improvement.

我已经使用 dtype 来减少内存占用,仍然没有改进。

Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.

基于多篇博文。我已将 numpy、pandas 全部更新到最新版本。仍然没有运气。

It would be great if anyone has a solution to this issue.

如果有人能解决这个问题,那就太好了。

Please note:

请注意:

  • I have a 64bit operating system(Windows 7)

  • I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]

  • I have 4GB Ram.

  • Numpy latest (pip installer says latest version installed)

  • Pandas Latest.(pip installer says latest version installed)

  • 我有一个 64 位操作系统(Windows 7)

  • 我正在运行 Python 2.7.10(默认,2015 年 5 月 23 日,09:40:32)[MSC v.1500 32 位]

  • 我有 4GB 内存。

  • Numpy 最新(pip 安装程序说安装了最新版本)

  • Pandas 最新。(pip 安装程序说安装了最新版本)

回答by Guillaume

If the file you are trying to read is too large to be contained as a whole in memory, you also cannot read it in chunks then reassemble it in memory, because in the end that needs at least as much memory.

如果您尝试读取的文件太大而无法作为一个整体包含在内存中,则您也无法将其分块读取然后在内存中重新组合,因为最终至少需要同样多的内存。

You could try to read the file in chuncks, filter out unnecessary rows in each chunck (based on the condition you are mentionning), then reassemble the remaining rows in a dataframe.

您可以尝试在 chuncks 中读取文件,过滤掉每个 chunck 中不必要的行(根据您提到的条件),然后重新组合数据帧中的剩余行。

Which gives something like that:

这给出了类似的东西:

df = pd.concat(apply_your_filter(chunck_df) for chunck_df in pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float}), ignore_index=True)

And/or find the max of each chunck, then the max of each of those chunck maxs.

和/或找到每个块的最大值,然后找到每个块的最大值。

回答by philshem

Pandas read_csv() has a low memory flag.

Pandas read_csv() 有一个低内存标志。

tp  = pd.read_csv('capture2.csv',low_memory=True, ...)

The low_memory flag is only available if you use the C parser

low_memory 标志仅在您使用 C 解析器时可用

engine : {‘c', ‘python'}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

引擎:{'c','python'},可选

要使用的解析器引擎。C 引擎速度更快,而 Python 引擎目前功能更完整。

You can also use the memory_map flag

您还可以使用 memory_map 标志

memory_map : boolean, default False

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

memory_map : 布尔值,默认为 False

如果为 filepath_or_buffer 提供了文件路径,则将文件对象直接映射到内存并直接从那里访问数据。使用此选项可以提高性能,因为不再有任何 I/O 开销。

source

来源



p.s. use 64bit python - see my comment

ps 使用 64 位 python - 见我的评论