pandas.concat 和 numpy.append 的大数据集的内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19590966/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:16:46  来源:igfitidea点击:

Memory error with large data sets for pandas.concat and numpy.append

pythonpython-2.7numpypandas

提问by Vidac

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

我面临一个问题,我必须在循环中生成大型 DataFrames(每次两个 2000 x 800 的 Pandas DataFrames 计算 50 次迭代)。我想将结果保存在更大的 DataFrame 或类似结构的字典中的内存中。使用 pandas.concat 时,我在循环中的某个点收到内存错误。当使用 numpy.append 将结果存储在 numpy 数组的字典中而不是 DataFrame 中时,也会发生同样的情况。在这两种情况下,我仍然有很多可用内存(几 GB)。对于Pandas或 numpy 来说,这是否有太多数据无法处理?是否有更节省内存的方法来存储我的数据而不将其保存在磁盘上?

As an example, the following script fails as soon as nbIdsis greater than 376:

例如,以下脚本一旦nbIds大于 376就会失败:

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

The code below fails when nbIdsis 665 or higher

下面的代码在nbIds665 或更高时失败

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

I do need to compute both DataFrames everytime, and for each element iof dataidsI need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIdsequal to 800 or more. Is there a straightforward way of doing this?

我确实需要每次都计算两个 DataFrame,并且对于我需要获取的每个元素idataids我需要获取一个Pandas Series 或一个包含 50 * 2000 为i. 理想情况下,我需要能够运行nbIds等于 800 或更多。有没有一种直接的方法来做到这一点?

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

我正在使用带有 Python 2.7.5、pandas 0.12.0 和 numpy 1.7.1 的 32 位 Python。

Thank you very much for your help!

非常感谢您的帮助!

采纳答案by Vidac

As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.

正如 usethedeathstar、Boud 和 Jeff 在评论中所建议的那样,切换到 64 位 python 可以解决问题。
如果丢失精度不是问题,那么使用 Jeff 建议的 float32 数据类型也会增加可以在 32 位环境中处理的数据量。

回答by Jeff

This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.

这基本上就是你在做什么。请注意,如果您在之前或之后转换为 DataFrames,从内存角度来看并没有太大区别。

But you can specify dtype='float32' to effectively 1/2 your memory.

但是您可以指定 dtype='float32' 来有效地占用 1/2 的内存。

In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000

In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000

In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)

回答by tk.

A straightforward (but using the hard drive) way would be to simply use shelve (a hard drive dict): http://docs.python.org/2/library/shelve.html

一个简单的(但使用硬盘)方法是简单地使用搁置(硬盘字典):http: //docs.python.org/2/library/shelve.html