Pandas - 关于应用功能缓慢的解释

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38697404/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:43:00  来源:igfitidea点击:

Pandas - Explanation on apply function being slow

pythonpandas

提问by linpingta

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

对于大型数据帧(大约 1 到 3 百万行),应用函数似乎工作得非常慢。

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

我在这里检查了相关问题,例如加速 Pandas apply 函数Pandas apply () 函数中计数,似乎加速它的最好方法是不使用 apply 函数:)

For my case, I have two kinds of tasks to do with the apply function.

就我而言,我有两种任务与 apply 函数有关。

First: apply with lookup dict query

第一:应用查找字典查询

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

第二:用groupby申请

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

我猜对于第一种情况,可以通过数据帧连接来完成,而我不确定在大型数据集上进行此类连接的资源成本。

My question is:

我的问题是:

  1. Is there any way to substitute apply in the two above cases?
  2. Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
  1. 有什么办法可以替代上述两种情况下的 apply 吗?
  2. 为什么申请这么慢?对于字典查找的情况,我认为它应该是 O(N),即使 N 是 100 万也不应该花费那么多。

采纳答案by andrew

Concerning your first question, I can't say exactly why this instance is slow. But generally, applydoes not take advantage of vectorization. Also, applyreturns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

关于你的第一个问题,我不能确切地说为什么这个实例很慢。但一般来说,apply不利用矢量化。此外,apply返回一个新的 Series 或 DataFrame 对象,因此对于非常大的 DataFrame,您有相当大的 IO 开销(我不能保证 100% 的情况下都是这种情况,因为 Pandas 有大量的内部实现优化)。

For your first method, I assume you are trying to fill a 'value' column in dfusing the p_dictas a lookup table. It is about 1000x faster to use pd.merge:

对于您的第一种方法,我假设您正在尝试df使用p_dict作为查找表来填充“值”列。使用速度大约快 1000 倍pd.merge

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 μs per loop

Concerning the second task, we can quickly add a new column to p_dictthat calculates a mean where the time window starts at min_week_numand ends at the week for that row in p_dict. This requires that p_dictis sorted by ascending order along the WEEKcolumn. Then you can use pd.mergeagain.

关于第二个任务,我们可以快速添加一个新列p_dict,计算平均值,其中在时间窗口开始min_week_num和结束的一周中该行p_dict。这要求p_dictWEEK列升序排序。然后你可以pd.merge再次使用。

I am assuming that min_week_numis 0 in the following example. But you could easily modify rolling_growing_meanto take a different value. The rolling_growing_meanmethod will run in O(n) since it conducts a fixed number of operations per iteration.

我假设min_week_num在以下示例中为 0。但是您可以轻松修改rolling_growing_mean以采用不同的值。该rolling_growing_mean方法将在 O(n) 中运行,因为它每次迭代执行固定数量的操作。

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')