在 Pandas 中用平均值转换组的更快方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22072943/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:44:52  来源:igfitidea点击:

Faster way to transform group with mean value in Pandas

pythonnumpypandas

提问by YXD

I have a Pandas dataframe where I am trying to replace the values in each group by the mean of the group. On my machine, the line df["signal"].groupby(g).transform(np.mean)takes about 10 seconds to run with Nand N_TRANSITIONSset to the numbers below.

我有一个 Pandas 数据框,我试图用该组的平均值替换每个组中的值。在我的机器上,这条线df["signal"].groupby(g).transform(np.mean)大约需要 10 秒才能运行NN_TRANSITIONS设置为下面的数字。

Is there any faster way to achieve the same result?

有没有更快的方法来达到同样的结果?

import pandas as pd
import numpy as np
from time import time

np.random.seed(0)

N = 120000
N_TRANSITIONS = 1400

# generate groups
transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS]
transition_points.sort()
transitions = np.zeros((N,), dtype=np.bool)
transitions[transition_points] = True
g = transitions.cumsum()

df = pd.DataFrame({ "signal" : np.random.rand(N)})

# here is my bottleneck for large N
tic = time()
result = df["signal"].groupby(g).transform(np.mean)
toc = time()
print toc - tic

回答by Jeff

Current method, using transform

当前方法,使用变换

In [44]: grp = df["signal"].groupby(g)

In [45]: result2 = df["signal"].groupby(g).transform(np.mean)

In [47]: %timeit df["signal"].groupby(g).transform(np.mean)
1 loops, best of 3: 535 ms per loop

Using 'broadcasting' of the results

使用结果的“广播”

 In [43]: result = pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)

In [42]: %timeit pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)
10 loops, best of 3: 119 ms per loop

In [46]: result.equals(result2)
Out[46]: True

I think you might need to set the index of the returned on the broadcast result (it happens to work here because its a default index

我认为您可能需要设置广播结果返回的索引(它碰巧在这里工作,因为它是默认索引

result = pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)
result.index = df.index

回答by YXD

Inspired by Jeff's answer. This is the fastest method on my machine:

受到杰夫回答的启发。这是我机器上最快的方法:

pd.Series(np.repeat(grp.mean().values, grp.count().values))