pandas.concat 和 numpy.append 的大数据集的内存错误

Question

提问by Vidac

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

我面临一个问题，我必须在循环中生成大型 DataFrames（每次两个 2000 x 800 的 Pandas DataFrames 计算 50 次迭代）。我想将结果保存在更大的 DataFrame 或类似结构的字典中的内存中。使用 pandas.concat 时，我在循环中的某个点收到内存错误。当使用 numpy.append 将结果存储在 numpy 数组的字典中而不是 DataFrame 中时，也会发生同样的情况。在这两种情况下，我仍然有很多可用内存（几 GB）。对于Pandas或 numpy 来说，这是否有太多数据无法处理？是否有更节省内存的方法来存储我的数据而不将其保存在磁盘上？

As an example, the following script fails as soon as nbIdsis greater than 376:

例如，以下脚本一旦nbIds大于 376就会失败：

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

The code below fails when nbIdsis 665 or higher

下面的代码在nbIds665 或更高时失败

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

I do need to compute both DataFrames everytime, and for each element iof dataidsI need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIdsequal to 800 or more. Is there a straightforward way of doing this?

我确实需要每次都计算两个 DataFrame，并且对于我需要获取的每个元素i，dataids我需要获取一个Pandas Series 或一个包含 50 * 2000 为i. 理想情况下，我需要能够运行nbIds等于 800 或更多。有没有一种直接的方法来做到这一点？

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

我正在使用带有 Python 2.7.5、pandas 0.12.0 和 numpy 1.7.1 的 32 位 Python。

Thank you very much for your help!

非常感谢您的帮助！

Answer 1

采纳答案by Vidac

As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.

正如 usethedeathstar、Boud 和 Jeff 在评论中所建议的那样，切换到 64 位 python 可以解决问题。
如果丢失精度不是问题，那么使用 Jeff 建议的 float32 数据类型也会增加可以在 32 位环境中处理的数据量。

Answer 2

回答by Jeff

This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.

这基本上就是你在做什么。请注意，如果您在之前或之后转换为 DataFrames，从内存角度来看并没有太大区别。

But you can specify dtype='float32' to effectively 1/2 your memory.

但是您可以指定 dtype='float32' 来有效地占用 1/2 的内存。

In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000

In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000

In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)

Answer 3

回答by tk.

A straightforward (but using the hard drive) way would be to simply use shelve (a hard drive dict): http://docs.python.org/2/library/shelve.html

一个简单的（但使用硬盘）方法是简单地使用搁置（硬盘字典）：http: //docs.python.org/2/library/shelve.html

pandas.concat 和 numpy.append 的大数据集的内存错误

提问by Vidac

采纳答案by Vidac

回答by Jeff

回答by tk.

相关推荐

最近更新

标签

pandas.concat 和 numpy.append 的大数据集的内存错误

提问by Vidac

采纳答案by Vidac

回答by Jeff

回答by tk.

相关推荐

在 Pandas 的多索引数据帧上使用滚动函数

如何有效地从 Pandas 数据帧转移到 JSON

pandas 如何*不*在ipython笔记本（熊猫数据框的html表）中显示'NaN'？

如果只有一列，为什么 Pandas Transform 会失败

相关推荐

最近更新

标签

pandas 如何不在ipython笔记本（熊猫数据框的html表）中显示'NaN'？