5000 万行的 Pandas groupby+transform 需要 3 小时

Question

提问by Vipin

I am using pandas module. In my DataFrame 3 fields are account ,month and salary.

我正在使用Pandas模块。在我的 DataFrame 3 字段中是帐户、月份和工资。

    account month              Salary
    1       201501             10000
    2       201506             20000
    2       201506             20000
    3       201508             30000
    3       201508             30000
    3       201506             10000
    3       201506             10000
    3       201506             10000
    3       201506             10000

I am doing groupby on Account and Month and convert salary to percent of salary of group it belongs.

我在 Account 和 Month 上进行 groupby 并将工资转换为它所属组的工资百分比。

MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(lambda x: x/x.sum())

Now MyDataFrame becomes like below table

现在 MyDataFrame 变成如下表

    account month              Salary
    1       201501             1
    2       201506             .5
    2       201506             .5
    3       201508             .5
    3       201508             .5
    3       201506             .25
    3       201506             .25
    3       201506             .25
    3       201506             .25

Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?

问题是：操作 5000 万这样的行需要 3 个小时。我单独执行了 groupyby，它的速度很快，只需要 5 秒。我认为这里的转换需要很长时间。有什么办法可以提高性能吗？

Update: To provide more clarity adding example Some account holder received salary 2000 in Jun and 8000 in July so his proportion becomes .2 for Jun and .8 for July. my purpose is to calculate this proportion.

更新：为了提供更清晰的示例，某些帐户持有人在 6 月份收到了 2000 美元的工资，在 7 月份收到了 8000 美元，因此他的比例在 6 月份变为 0.2，在 7 月份变为 0.8。我的目的是计算这个比例。

Answer 1

采纳答案by Jeff

Well you need be more explicit and show exactly what you are doing. This is something pandas excels at.

那么你需要更加明确并准确地展示你在做什么。这是Pandas擅长的事情。

Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.

@Uri Goren 的注释。这是一个恒定的内存过程，一次只有 1 个组在内存中。这将与组数成线性关系。排序也是不必要的。

In [20]: np.random.seed(1234)

In [21]: ngroups = 1000

In [22]: nrows = 50000000

In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)

In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
                 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
                 'values' : np.random.randn(nrows) })


In [25]: 

In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account    int64
date       datetime64[ns]
values     float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB

In [26]: df.head()
Out[26]: 
   account       date    values
0      815 2048-02-01 -0.412587
1      723 2023-01-01 -0.098131
2      294 2020-11-01 -2.899752
3       53 2058-02-01 -0.469925
4      204 2080-11-01  1.389950

In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop

If you want to transform the output, then doit like this

如果你想转换输出，那么像这样

In [37]: g = df.groupby(['account','date'])['values']

In [38]: result = 100*df['values']/g.transform('sum')

In [41]: result.head()
Out[41]: 
0     4.688957
1    -2.340621
2   -80.042089
3   -13.813078
4   -70.857014
dtype: float64

In [43]: len(result)
Out[43]: 50000000

In [42]: %timeit 100*df['values']/g.transform('sum')
1 loops, best of 3: 30.9 s per loop

Take a bit longer. But again this should be a relatively fast operation.

再花点时间。但同样，这应该是一个相对较快的操作。

Answer 2

回答by Uri Goren

I would use a different approach First Sort,

我会使用不同的方法 First Sort，

MyDataFrame.sort(['account','month'],inplace=True)

Then iterate and sum

然后迭代并求和

(account,month)=('','') #some invalid values
salary=0.0
res=[]
for index, row in MyDataFrame.iterrows():
  if (row['account'],row['month'])==(account,month):
    salary+=row['salary']
  else:
    res.append([account,month,salary])
    salary=0.0
    (account,month)=(row['account'],row['month'])
df=pd.DataFrame(res,columns=['account','month','salary'])

This way, pandas don't need to hold the grouped data in memory.

这样，pandas 就不需要将分组数据保存在内存中。

5000 万行的 Pandas groupby+transform 需要 3 小时

提问by Vipin

采纳答案by Jeff

回答by Uri Goren

相关推荐

最近更新

标签

5000 万行的 Pandas groupby+transform 需要 3 小时

提问by Vipin

采纳答案by Jeff

回答by Uri Goren

相关推荐

将 pandas 列一分为二

pandas Python：计算时间序列的对数回报

pandas 在熊猫数据框上使用 str.contains

pandas python 中没有名为 read_csv 的属性

相关推荐

最近更新

标签