Python 使用 pandas read_csv 时出现内存错误

Question

提问by Anne

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.

我正在尝试做一些相当简单的事情，将一个大的 csv 文件读入一个 Pandas 数据帧。

data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)

The code either fails with a MemoryError, or just never finishes.

代码要么失败MemoryError，要么永远不会完成。

Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.

任务管理器中的内存使用量在 506 Mb 时停止，在 5 分钟没有变化且进程中没有 CPU 活动后，我停止了它。

I am using pandas version 0.11.0.

我正在使用熊猫版本 0.11.0。

I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543this should have been fixed.

我知道文件解析器曾经存在内存问题，但根据http://wesmckinney.com/blog/?p=543应该已经修复了。

The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).

我试图读取的文件是 366 Mb，如果我将文件缩减为较短的文件（25 Mb），则上面的代码有效。

It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...

还发生了一个弹出窗口，告诉我它无法写入地址 0x1e0baf93 ...

Stacktrace:

堆栈跟踪：

Traceback (most recent call last):
  File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
 <module>
    wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
    return parser.read()
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
    dtype=dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
    float_blocks = _multi_blockify(float_items, items)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
    block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
  File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .

A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does

一些背景 - 我试图说服人们 Python 可以做与 R 相同的事情。为此，我试图复制一个 R 脚本

data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)

R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...

R 不仅能够很好地读取上述文件，它甚至可以在 for 循环中读取其中的几个文件（然后对数据进行一些处理）。如果 Python 确实对这种大小的文件有问题，我可能会打一场败仗……

Answer 1

回答by LetMeSOThat4U

Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_jsonmethod instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

虽然这是一种解决方法而不是修复，但我会尝试将该 CSV 转换为 JSON（应该很简单）并使用read_json方法代替 - 我一直在 Pandas 中编写和读取相当大的 JSON/数据帧（100 MB）这个完全没有任何问题的方式。

Answer 2

回答by Oleksandr

There is no error for Pandas 0.12.0 and NumPy 1.8.0.

Pandas 0.12.0 和 NumPy 1.8.0 没有错误。

I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.

我设法创建了一个大的 DataFrame 并将其保存到一个 csv 文件，然后成功读取它。请参阅此处的示例。文件大小为 554 Mb（它甚至适用于 1.1 Gb 文件，需要更长的时间，生成 1.1 Gb 文件使用频率为 30 秒）。虽然我有 4Gb 的 RAM 可用。

My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.

我的建议是尝试更新 Pandas。其他可能有用的方法是尝试从命令行运行脚本，因为对于 R，您没有使用 Visual Studio（在您的问题的评论中已经建议了这一点），因此它有更多可用资源。

Answer 3

回答by SebastianNeubauer

I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.

当我在虚拟机中运行时，或者在内存受到严格限制的其他地方运行时，我也遇到了这个问题。它与 Pandas 或 numpy 或 csv 无关，但如果您尝试使用更多的内存（即使不仅仅是在 python 中），它也会发生。

The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.

你拥有的唯一机会就是你已经尝试过的东西，试着把大的东西切成适合记忆的小块。

If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.

如果你曾经问过自己 MapReduce 是什么，你自己就会发现……MapReduce 会尝试将块分布到多台机器上，你会尝试在一台机器上一台接一台地处理块。

What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...

您在块文件的串联中发现的内容确实可能是一个问题，也许此操作中需要一些副本...但最终这可能会在您当前的情况下保存您，但是如果您的 csv 变得更大一点你可能会再次撞墙……

It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?

也可能是，pandas 太聪明了，如果你用它做一些事情，它实际上只会将单个数据块加载到内存中，比如连接到一个大 df？

Several things you can try:

您可以尝试几件事：

Don't load all the data at once, but split in in pieces
As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
Look if the types are ok, a string '0.111111' needs more memory than a float
What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)

不要一次加载所有数据，而是分成几部分
据我所知，hdf5 能够自动执行这些块，并且只加载您的程序当前正在处理的部分
查看类型是否正常，字符串 '0.111111' 需要比浮点数更多的内存
你实际上需要什么，如果地址是一个字符串，你可能不需要它来进行数值分析......
数据库可以帮助访问和加载您实际需要的部分（例如，只有 1% 的活跃用户）

Answer 4

回答by Tarik

I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.

我在我的 Linux 机器上使用 Pandas 并且遇到了许多内存泄漏，只有在从 github 克隆 Pandas 到最新版本后才得到解决。

Answer 5

回答by firelynx

Windows memory limitation

Windows 内存限制

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play withby default.

在 Windows 中使用 32 位版本时，python 经常发生内存错误。这是因为 32 位进程默认只能获得 2GB 的内存。

Tricks for lowering memory usage

降低内存使用率的技巧

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

如果您不是在 Windows 中使用 32 位 python，而是希望在读取 csv 文件时提高内存效率，那么有一个技巧。

The pandas.read_csv functiontakes an option called dtype. This lets pandas know what types exist inside your csv data.

该pandas.read_csv功能采用所谓的选项dtype。这让 Pandas 知道您的 csv 数据中存在哪些类型。

How this works

这是如何工作的

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

默认情况下，pandas 会尝试猜测您的 csv 文件具有哪些 dtypes。这是一个非常繁重的操作，因为在确定 dtype 时，它必须将所有原始数据作为对象（字符串）保存在内存中。

Example

例子

Let's say your csv looks like this:

假设您的 csv 如下所示：

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

这个例子读入内存当然没问题，但这只是一个例子。

If pandas were to read the above csv file withoutany dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

如果 pandas 要在没有任何 dtype 选项的情况下读取上述 csv 文件，则年龄将作为字符串存储在内存中，直到 pandas 读取了足够多的 csv 文件行以进行合格的猜测。

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

我认为 pandas 的默认设置是在猜测 dtype 之前读取 1,000,000 行。

Solution

解决方案

By specifying dtype={'age':int}as an option to the .read_csv()will let pandas know that age should be interpreted as a number. This saves you lots of memory.

通过将dtype={'age':int}选项指定为.read_csv()will 让大熊猫知道年龄应该被解释为一个数字。这为您节省了大量内存。

Problem with corrupt data

数据损坏的问题

However, if your csv file would be corrupted, like this:

但是，如果您的 csv 文件已损坏，如下所示：

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int}will break the .read_csv()command, because it cannot cast "40+"to int. So sanitize your data carefully!

然后指定dtype={'age':int}将破坏.read_csv()命令，因为它不能转换"40+"为 int。因此，请仔细清理您的数据！

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

在这里，您可以看到当浮点数保留为字符串时，pandas 数据帧的内存使用量如何高得多：

Try it yourself

自己试试

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)

df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)

Answer 6

回答by mooseman

I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem：

我在简单读取一个大小约为 1 GB 的制表符分隔文本文件（超过 550 万条记录）时遇到了同样的内存问题，这解决了内存问题：

df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds

Spyder 3.2.3 Python 2.7.13 64bits

Spyder 3.2.3 Python 2.7.13 64 位

Answer 7

回答by muTheTechie

I tried chunksizewhile reading big CSV file

我chunksize在阅读大 CSV 文件时尝试过

reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

The read is now the list. We can iterate the readerand write/append to the new csv or can perform any operation

读取现在是列表。我们可以迭代reader并写入/附加到新的 csv 或可以执行任何操作

for chunk in reader:
    print(newChunk.columns)
    print("Chunk -> File process")
    with open(destination, 'a') as f:
        newChunk.to_csv(f, header=False,sep='\t',index=False)
        print("Chunk appended to the file")

Python 使用 pandas read_csv 时出现内存错误

提问by Anne

回答by LetMeSOThat4U

回答by Oleksandr

回答by SebastianNeubauer

回答by Tarik

回答by firelynx

Windows memory limitation

Windows 内存限制

Tricks for lowering memory usage

降低内存使用率的技巧

How this works

这是如何工作的

Example

例子

Solution

解决方案

Problem with corrupt data

数据损坏的问题

Try it yourself

自己试试

回答by mooseman

回答by muTheTechie

相关推荐

最近更新

标签

Python 使用 pandas read_csv 时出现内存错误

提问by Anne

回答by LetMeSOThat4U

回答by Oleksandr

回答by SebastianNeubauer

回答by Tarik

回答by firelynx

Windows memory limitation

Windows 内存限制

Tricks for lowering memory usage

降低内存使用率的技巧

How this works

这是如何工作的

Example

例子

Solution

解决方案

Problem with corrupt data

数据损坏的问题

Try it yourself

自己试试

回答by mooseman

回答by muTheTechie

相关推荐

Python 如何创建一个事件循环，滚动协程永远运行在它上面？

Python 类型错误：缺少 1 个必需的位置参数：'self'

在 python 2.7.9 中禁用默认证书验证

Python/Django：如何断言单元测试结果包含某个字符串？

相关推荐

最近更新

标签