pandas 熊猫:在 groupby 组内对观察进行排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36073984/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:53:50  来源:igfitidea点击:

pandas: sorting observations within groupby groups

pythonpandas

提问by Dmitry B.

According to the answer to pandas groupby sort within groups, in order to sort observations within each group one needs to do a second groupbyon the results of the first groupby. Why a second groupbyis needed? I would've assumed that observations are already arranged into groups after running the first groupbyand all that would be needed is a way to enumerate those groups (and run applywith order).

根据pandas groupby sort inside groups的答案,为了对每个组内的观察结果进行排序,需要对第一个groupby结果进行第二次处理groupby。为什么groupby需要一秒钟?我会假设在运行第一个之后观察已经被安排成组groupby,所需要的只是一种枚举这些组的方法(并apply使用order)。

回答by tvashtar

Because once you apply a function after a groupby the results are combined back into a normal ungrouped data frame. Using groupby and a groupby method like sort should be thought of like a Split-Apply-Combine operation

因为一旦在 groupby 之后应用函数,结果就会组合回正常的未分组数据框。使用 groupby 和诸如 sort 之类的 groupby 方法应该被认为是Split-Apply-Combine 操作

The groupby splits the original data frame and the method is applied to each group, but then the results are combined again implicitly.

groupby 拆分原始数据框并将该方法应用于每个组,但随后再次隐式组合结果。

In that other question, they could have reversed the operation (sorted first) and then not have to use two groupbys. They could do:

在另一个问题中,他们本可以颠倒操作(先排序),然后不必使用两个 groupby。他们可以这样做:

df.sort(['job','count'],ascending=False).groupby('job').head(3)

回答by Istopopoki

They need a second group by in that case, because on top of sorting, they want to keep only the top 3 rows of each group.

在这种情况下,他们需要第二个 group by,因为在排序的基础上,他们只想保留每个 group 的前 3 行。

If you just need to sort after a group by you can do :

如果您只需要按组排序,您可以执行以下操作:

df_res = df.groupby(['job','source']).agg({'count':sum}).sort_values(['job','count'],ascending=False)

One group by is enough.

一组就够了。

And if you want to keep the 3 rows with the highest count for each group, then you can group again and use the head() function :

如果你想保留每组计数最高的 3 行,那么你可以再次分组并使用 head() 函数:

df_res.groupby('job').head(3)