5000 万行的 Pandas groupby+transform 需要 3 小时
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31890613/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby+transform on 50 million rows is taking 3 hours
提问by Vipin
I am using pandas module. In my DataFrame 3 fields are account ,month and salary.
我正在使用Pandas模块。在我的 DataFrame 3 字段中是帐户、月份和工资。
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and convert salary to percent of salary of group it belongs.
我在 Account 和 Month 上进行 groupby 并将工资转换为它所属组的工资百分比。
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(lambda x: x/x.sum())
Now MyDataFrame becomes like below table
现在 MyDataFrame 变成如下表
account month Salary
1 201501 1
2 201506 .5
2 201506 .5
3 201508 .5
3 201508 .5
3 201506 .25
3 201506 .25
3 201506 .25
3 201506 .25
Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?
问题是:操作 5000 万这样的行需要 3 个小时。我单独执行了 groupyby,它的速度很快,只需要 5 秒。我认为这里的转换需要很长时间。有什么办法可以提高性能吗?
Update: To provide more clarity adding example Some account holder received salary 2000 in Jun and 8000 in July so his proportion becomes .2 for Jun and .8 for July. my purpose is to calculate this proportion.
更新:为了提供更清晰的示例,某些帐户持有人在 6 月份收到了 2000 美元的工资,在 7 月份收到了 8000 美元,因此他的比例在 6 月份变为 0.2,在 7 月份变为 0.8。我的目的是计算这个比例。
采纳答案by Jeff
Well you need be more explicit and show exactly what you are doing. This is something pandas excels at.
那么你需要更加明确并准确地展示你在做什么。这是Pandas擅长的事情。
Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
@Uri Goren 的注释。这是一个恒定的内存过程,一次只有 1 个组在内存中。这将与组数成线性关系。排序也是不必要的。
In [20]: np.random.seed(1234)
In [21]: ngroups = 1000
In [22]: nrows = 50000000
In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)
In [24]: df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
'values' : np.random.randn(nrows) })
In [25]:
In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account int64
date datetime64[ns]
values float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB
In [26]: df.head()
Out[26]:
account date values
0 815 2048-02-01 -0.412587
1 723 2023-01-01 -0.098131
2 294 2020-11-01 -2.899752
3 53 2058-02-01 -0.469925
4 204 2080-11-01 1.389950
In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop
If you want to transform the output, then doit like this
如果你想转换输出,那么像这样
In [37]: g = df.groupby(['account','date'])['values']
In [38]: result = 100*df['values']/g.transform('sum')
In [41]: result.head()
Out[41]:
0 4.688957
1 -2.340621
2 -80.042089
3 -13.813078
4 -70.857014
dtype: float64
In [43]: len(result)
Out[43]: 50000000
In [42]: %timeit 100*df['values']/g.transform('sum')
1 loops, best of 3: 30.9 s per loop
Take a bit longer. But again this should be a relatively fast operation.
再花点时间。但同样,这应该是一个相对较快的操作。
回答by Uri Goren
I would use a different approach First Sort,
我会使用不同的方法 First Sort,
MyDataFrame.sort(['account','month'],inplace=True)
Then iterate and sum
然后迭代并求和
(account,month)=('','') #some invalid values
salary=0.0
res=[]
for index, row in MyDataFrame.iterrows():
if (row['account'],row['month'])==(account,month):
salary+=row['salary']
else:
res.append([account,month,salary])
salary=0.0
(account,month)=(row['account'],row['month'])
df=pd.DataFrame(res,columns=['account','month','salary'])
This way, pandas don't need to hold the grouped data in memory.
这样,pandas 就不需要将分组数据保存在内存中。

