Pandas - 关于应用功能缓慢的解释

Question

提问by linpingta

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

对于大型数据帧（大约 1 到 3 百万行），应用函数似乎工作得非常慢。

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

我在这里检查了相关问题，例如加速 Pandas apply 函数和在 Pandas apply () 函数中计数，似乎加速它的最好方法是不使用 apply 函数:)

For my case, I have two kinds of tasks to do with the apply function.

就我而言，我有两种任务与 apply 函数有关。

First: apply with lookup dict query

第一：应用查找字典查询

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

第二：用groupby申请

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

我猜对于第一种情况，可以通过数据帧连接来完成，而我不确定在大型数据集上进行此类连接的资源成本。

My question is:

我的问题是：

Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.

有什么办法可以替代上述两种情况下的 apply 吗？
为什么申请这么慢？对于字典查找的情况，我认为它应该是 O(N)，即使 N 是 100 万也不应该花费那么多。

Answer 1

采纳答案by andrew

Concerning your first question, I can't say exactly why this instance is slow. But generally, applydoes not take advantage of vectorization. Also, applyreturns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

关于你的第一个问题，我不能确切地说为什么这个实例很慢。但一般来说，apply不利用矢量化。此外，apply返回一个新的 Series 或 DataFrame 对象，因此对于非常大的 DataFrame，您有相当大的 IO 开销（我不能保证 100% 的情况下都是这种情况，因为 Pandas 有大量的内部实现优化）。

For your first method, I assume you are trying to fill a 'value' column in dfusing the p_dictas a lookup table. It is about 1000x faster to use pd.merge:

对于您的第一种方法，我假设您正在尝试df使用p_dict作为查找表来填充“值”列。使用速度大约快 1000 倍pd.merge：

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 μs per loop

Concerning the second task, we can quickly add a new column to p_dictthat calculates a mean where the time window starts at min_week_numand ends at the week for that row in p_dict. This requires that p_dictis sorted by ascending order along the WEEKcolumn. Then you can use pd.mergeagain.

关于第二个任务，我们可以快速添加一个新列p_dict，计算平均值，其中在时间窗口开始min_week_num和结束的一周中该行p_dict。这要求p_dict按WEEK列升序排序。然后你可以pd.merge再次使用。

I am assuming that min_week_numis 0 in the following example. But you could easily modify rolling_growing_meanto take a different value. The rolling_growing_meanmethod will run in O(n) since it conducts a fixed number of operations per iteration.

我假设min_week_num在以下示例中为 0。但是您可以轻松修改rolling_growing_mean以采用不同的值。该rolling_growing_mean方法将在 O(n) 中运行，因为它每次迭代执行固定数量的操作。

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

Pandas - 关于应用功能缓慢的解释

提问by linpingta

采纳答案by andrew

相关推荐

最近更新

标签

Pandas - 关于应用功能缓慢的解释

提问by linpingta

采纳答案by andrew

相关推荐

pandas 熊猫：将空单元格替换为 0

Python/Pandas - 将 Pandas DataFrame 划分为 10 个不相交、大小相同的子集

Pandas - 修改每个单元格中的字符串值

pandas Python 中的多元线性回归（PatsyError：模型缺少所需的结果变量）

相关推荐

最近更新

标签