Pandas.read_csv() 内存错误

Question

提问by suhas

I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.

我有一个 1GB 的 csv 文件。该文件有大约 10000000（10 百万）行。我需要遍历行以获得几个选定行的最大值（基于条件）。问题是读取 csv 文件。

I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.

我使用 Python 的 Pandas 包。read_csv() 函数在读取 csv 文件时抛出 MemoryError。1) 我试图将文件拆分成块并读取它们，现在， concat() 函数存在内存问题。

tp  = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float,  'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float,   'vdd_ext_flash_v': float,   'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})

df = pd.concat(tp,ignore_index=True)

I have used the dtype to reduce memory hog, still there is no improvement.

我已经使用 dtype 来减少内存占用，仍然没有改进。

Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.

基于多篇博文。我已将 numpy、pandas 全部更新到最新版本。仍然没有运气。

It would be great if anyone has a solution to this issue.

如果有人能解决这个问题，那就太好了。

Please note:

请注意：

I have a 64bit operating system(Windows 7)
I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]
I have 4GB Ram.
Numpy latest (pip installer says latest version installed)
Pandas Latest.(pip installer says latest version installed)

我有一个 64 位操作系统（Windows 7）
我正在运行 Python 2.7.10（默认，2015 年 5 月 23 日，09:40:32）[MSC v.1500 32 位]
我有 4GB 内存。
Numpy 最新（pip 安装程序说安装了最新版本）
Pandas 最新。（pip 安装程序说安装了最新版本）

Answer 1

回答by Guillaume

If the file you are trying to read is too large to be contained as a whole in memory, you also cannot read it in chunks then reassemble it in memory, because in the end that needs at least as much memory.

如果您尝试读取的文件太大而无法作为一个整体包含在内存中，则您也无法将其分块读取然后在内存中重新组合，因为最终至少需要同样多的内存。

You could try to read the file in chuncks, filter out unnecessary rows in each chunck (based on the condition you are mentionning), then reassemble the remaining rows in a dataframe.

您可以尝试在 chuncks 中读取文件，过滤掉每个 chunck 中不必要的行（根据您提到的条件），然后重新组合数据帧中的剩余行。

Which gives something like that:

这给出了类似的东西：

df = pd.concat(apply_your_filter(chunck_df) for chunck_df in pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float}), ignore_index=True)

And/or find the max of each chunck, then the max of each of those chunck maxs.

和/或找到每个块的最大值，然后找到每个块的最大值。

Answer 2

回答by philshem

Pandas read_csv() has a low memory flag.

Pandas read_csv() 有一个低内存标志。

tp  = pd.read_csv('capture2.csv',low_memory=True, ...)

The low_memory flag is only available if you use the C parser

low_memory 标志仅在您使用 C 解析器时可用

engine : {‘c', ‘python'}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

引擎：{'c'，'python'}，可选
要使用的解析器引擎。C 引擎速度更快，而 Python 引擎目前功能更完整。

You can also use the memory_map flag

您还可以使用 memory_map 标志

memory_map : boolean, default False
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

memory_map : 布尔值，默认为 False
如果为 filepath_or_buffer 提供了文件路径，则将文件对象直接映射到内存并直接从那里访问数据。使用此选项可以提高性能，因为不再有任何 I/O 开销。

source

来源

p.s. use 64bit python - see my comment

ps 使用 64 位 python - 见我的评论

Pandas.read_csv() 内存错误

提问by suhas

回答by Guillaume

回答by philshem

相关推荐

最近更新

标签

Pandas.read_csv() 内存错误

提问by suhas

回答by Guillaume

回答by philshem

相关推荐

类型错误：无法使用 <class 'tuple'> 的这些索引器 [(2,)] 对 <class 'pandas.indexes.numeric.Int64Index'> 进行切片索引

pandas 为熊猫数据框中的两列创建邻接矩阵

pandas 熊猫在日期列问题上合并

Pandas DataFrame styler HTML 显示没有索引？

相关推荐

最近更新

标签