Python Pandas MemoryError

Question

提问by wuha

I have those packages installed:

我安装了这些软件包：

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

这是数据框信息：

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Date_Str        421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

这是数据的示例：

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE

I load the file and index it as follows:

我加载文件并将其索引如下：

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

当我对一个 400K 行的文件进行以下操作时，

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

或者

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

它会导致错误：

Traceback (most recent call last):
  File "rock.py", line 85, in <module>
    rock.pandasTest()
  File "rock.py", line 31, in pandasTest
    df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
    return_indexers=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
    how=how, sort=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

但是，它适用于小文件。

Answer 1

采纳答案by joris

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

我也可以在 0.13.1 上重现它，但是在 0.12 或 0.14（昨天发布）中不会出现该问题，因此它似乎是 0.13 中的错误。
因此，也许可以尝试升级您的 Pandas 版本，因为矢量化方式在应用时要快得多（在我的机器上为 5s 与 >1min），并且在 0.14 上使用更少的峰值内存（200Mb 与 980Mb，使用 %memit）

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_idfunction of @jsalonen:

使用重复 50000 次的样本数据（导致 df 为 450k 行），并使用apply_id@jsalonen的功能：

In [23]: pd.__version__ 
Out[23]: '0.14.0'

In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop

In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop

In [26]: %load_ext memory_profiler

In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB

In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB

Answer 2

回答by jsalonen

Try generating the _idfield with DataFrame.apply call:

尝试_id使用 DataFrame.apply 调用生成字段：

def apply_id(x):
    x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
    return x

df_train = df_train.apply(apply_id, 1)

When using applythe id generation is performed per row resulting in minimal overhead in memory allocation.

当使用applyid 生成时，每行执行导致内存分配的开销最小。

Python Pandas MemoryError

提问by wuha

采纳答案by joris

回答by jsalonen

相关推荐

最近更新

标签

Python Pandas MemoryError

提问by wuha

采纳答案by joris

回答by jsalonen

相关推荐

pandas 从 Yahoo! 加载数据 熊猫理财

pandas 如何将我的熊猫数据框移动到 d3？

pandas 熊猫：跨行条件计数

在 pandas DataFrame 中查找与时间戳对应的行

相关推荐

最近更新

标签

pandas 从 Yahoo! 加载数据熊猫理财