Pandas - 关于应用功能缓慢的解释
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38697404/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas - Explanation on apply function being slow
提问by linpingta
Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
对于大型数据帧(大约 1 到 3 百万行),应用函数似乎工作得非常慢。
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
我在这里检查了相关问题,例如加速 Pandas apply 函数和在Pandas apply () 函数中计数,似乎加速它的最好方法是不使用 apply 函数:)
For my case, I have two kinds of tasks to do with the apply function.
就我而言,我有两种任务与 apply 函数有关。
First: apply with lookup dict query
第一:应用查找字典查询
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
第二:用groupby申请
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
我猜对于第一种情况,可以通过数据帧连接来完成,而我不确定在大型数据集上进行此类连接的资源成本。
My question is:
我的问题是:
- Is there any way to substitute apply in the two above cases?
- Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
- 有什么办法可以替代上述两种情况下的 apply 吗?
- 为什么申请这么慢?对于字典查找的情况,我认为它应该是 O(N),即使 N 是 100 万也不应该花费那么多。
采纳答案by andrew
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply
does not take advantage of vectorization. Also, apply
returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
关于你的第一个问题,我不能确切地说为什么这个实例很慢。但一般来说,apply
不利用矢量化。此外,apply
返回一个新的 Series 或 DataFrame 对象,因此对于非常大的 DataFrame,您有相当大的 IO 开销(我不能保证 100% 的情况下都是这种情况,因为 Pandas 有大量的内部实现优化)。
For your first method, I assume you are trying to fill a 'value' column in df
using the p_dict
as a lookup table. It is about 1000x faster to use pd.merge
:
对于您的第一种方法,我假设您正在尝试df
使用p_dict
作为查找表来填充“值”列。使用速度大约快 1000 倍pd.merge
:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 μs per loop
Concerning the second task, we can quickly add a new column to p_dict
that calculates a mean where the time window starts at min_week_num
and ends at the week for that row in p_dict
. This requires that p_dict
is sorted by ascending order along the WEEK
column. Then you can use pd.merge
again.
关于第二个任务,我们可以快速添加一个新列p_dict
,计算平均值,其中在时间窗口开始min_week_num
和结束的一周中该行p_dict
。这要求p_dict
按WEEK
列升序排序。然后你可以pd.merge
再次使用。
I am assuming that min_week_num
is 0 in the following example. But you could easily modify rolling_growing_mean
to take a different value. The rolling_growing_mean
method will run in O(n) since it conducts a fixed number of operations per iteration.
我假设min_week_num
在以下示例中为 0。但是您可以轻松修改rolling_growing_mean
以采用不同的值。该rolling_growing_mean
方法将在 O(n) 中运行,因为它每次迭代执行固定数量的操作。
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')