Pandas.read_csv() 内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42931068/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas.read_csv() MemoryError
提问by suhas
I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.
我有一个 1GB 的 csv 文件。该文件有大约 10000000(10 百万)行。我需要遍历行以获得几个选定行的最大值(基于条件)。问题是读取 csv 文件。
I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.
我使用 Python 的 Pandas 包。read_csv() 函数在读取 csv 文件时抛出 MemoryError。1) 我试图将文件拆分成块并读取它们,现在, concat() 函数存在内存问题。
tp = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})
df = pd.concat(tp,ignore_index=True)
I have used the dtype to reduce memory hog, still there is no improvement.
我已经使用 dtype 来减少内存占用,仍然没有改进。
Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.
基于多篇博文。我已将 numpy、pandas 全部更新到最新版本。仍然没有运气。
It would be great if anyone has a solution to this issue.
如果有人能解决这个问题,那就太好了。
Please note:
请注意:
I have a 64bit operating system(Windows 7)
I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]
I have 4GB Ram.
Numpy latest (pip installer says latest version installed)
Pandas Latest.(pip installer says latest version installed)
我有一个 64 位操作系统(Windows 7)
我正在运行 Python 2.7.10(默认,2015 年 5 月 23 日,09:40:32)[MSC v.1500 32 位]
我有 4GB 内存。
Numpy 最新(pip 安装程序说安装了最新版本)
Pandas 最新。(pip 安装程序说安装了最新版本)
回答by Guillaume
If the file you are trying to read is too large to be contained as a whole in memory, you also cannot read it in chunks then reassemble it in memory, because in the end that needs at least as much memory.
如果您尝试读取的文件太大而无法作为一个整体包含在内存中,则您也无法将其分块读取然后在内存中重新组合,因为最终至少需要同样多的内存。
You could try to read the file in chuncks, filter out unnecessary rows in each chunck (based on the condition you are mentionning), then reassemble the remaining rows in a dataframe.
您可以尝试在 chuncks 中读取文件,过滤掉每个 chunck 中不必要的行(根据您提到的条件),然后重新组合数据帧中的剩余行。
Which gives something like that:
这给出了类似的东西:
df = pd.concat(apply_your_filter(chunck_df) for chunck_df in pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float}), ignore_index=True)
And/or find the max of each chunck, then the max of each of those chunck maxs.
和/或找到每个块的最大值,然后找到每个块的最大值。
回答by philshem
Pandas read_csv() has a low memory flag.
Pandas read_csv() 有一个低内存标志。
tp = pd.read_csv('capture2.csv',low_memory=True, ...)
The low_memory flag is only available if you use the C parser
low_memory 标志仅在您使用 C 解析器时可用
engine : {‘c', ‘python'}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
引擎:{'c','python'},可选
要使用的解析器引擎。C 引擎速度更快,而 Python 引擎目前功能更完整。
You can also use the memory_map flag
您还可以使用 memory_map 标志
memory_map : boolean, default False
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
memory_map : 布尔值,默认为 False
如果为 filepath_or_buffer 提供了文件路径,则将文件对象直接映射到内存并直接从那里访问数据。使用此选项可以提高性能,因为不再有任何 I/O 开销。
p.s. use 64bit python - see my comment
ps 使用 64 位 python - 见我的评论