Python Pandas:使用 groupby() 和 agg() 时是否保留顺序?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26456125/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:35:33  来源:igfitidea点击:

Python Pandas: Is Order Preserved When Using groupby() and agg()?

pythonpandasaggregate

提问by BringMyCakeBack

I've frequented used pandas' agg()function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:

我经常使用agg()Pandas的函数对 data.frame 的每一列运行汇总统计。例如,以下是产生均值和标准差的方法:

df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102]})

>>> df
[output]
        A   B    C
0  group1  10  100
1  group1  12  102
2  group2  10  100
3  group2  25  250
4  group3  10  100
5  group3  12  102

In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:

在这两种情况下,将各个行发送到 agg 函数的顺序无关紧要。但请考虑以下示例,其中:

df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])

[output]

        mean  <lambda>  mean  <lambda>
A                                     
group1  11.0        12   101       102
group2  17.5        25   175       250
group3  11.0        12   101       102

In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg()along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.

在这种情况下,lambda 按预期运行,输出每组中的第二行。但是,我无法在 Pandas 文档中找到任何暗示这在所有情况下都是正确的。我想agg()与加权平均函数一起使用,所以我想确保进入函数的行的顺序与它们出现在原始数据框中的顺序相同。

Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?

有谁知道,理想情况下是通过文档或 Pandas 源代码中的某个地方,如果保证确实如此?

采纳答案by Jeff

See this enhancement issue

看到这个增强问题

The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

简短的回答是肯定的,groupby 将保留传入的顺序。您可以使用这样的示例来证明这一点:

In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]: 
           B             C         
        mean <lambda> mean <lambda>
A                                  
group1  11.0       10  101      100
group2  17.5       10  175      100
group3  11.0       10  101      100

This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

然而,这不适用于重新采样,因为它需要一个单调索引(它将与非单调索引一起使用,但会首先对其进行排序)。

Their is a sort=flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

他们是sort=groupby的标志,但这与组本身的排序有关,而不是组内的观察。

FYI: df.groupby('A').nth(1)is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

仅供参考:df.groupby('A').nth(1)是获取组的第二个值的安全方法(因为如果组具有 < 2 个元素,则上述方法将失败)

回答by Uwe Mayer

Panda's 0.19.1 doc says "groupby preserves the order of rows within each group", so this is guaranteed behavior.

Panda 的 0.19.1 文档说“groupby 保留每个组中的行顺序”,因此这是有保证的行为。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

回答by Dima Lituiev

In order to preserve order, you'll need to pass .groupby(..., sort=False). In your case the grouping column is already sorted, so it does not make difference, but generally one must use the sort=Falseflag:

为了保持顺序,您需要通过.groupby(..., sort=False). 在您的情况下,分组列已经排序,因此没有区别,但通常必须使用该sort=False标志:

 df.groupby('A', sort=False).agg([np.mean, lambda x: x.iloc[1] ])

回答by Jigidi Sarnath

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

参考:https: //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The API accepts "SORT" as an argument.

API 接受“SORT”作为参数。

Description for SORT argument is like this:

SORT 参数的描述是这样的:

sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

sort : bool,默认 True Sort 组键。关闭此功能可获得更好的性能。请注意,这不会影响每个组内的观察顺序。Groupby 保留每个组中行的顺序

Thus, it is clear the "Groupby" does preserve the order of rows within each group.

因此,很明显“Groupby”确实保留了每个组中行的顺序。

回答by TinaW

Even easier:

更简单:

  import pandas as pd
  pd.pivot_table(df,index='A',aggfunc=(np.mean))

output:

输出:

            B    C
     A                
   group1  11.0  101
   group2  17.5  175
   group3  11.0  101