Python pandas groupby 没有将按列分组转换为索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32059397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:56:25  来源:igfitidea点击:

pandas groupby without turning grouped by column into index

pythonpandasdataframe

提问by Mohamed Ali JAMAOUI

The default behavior of pandas groupby is to turn the group by columns into index and remove them from the list of columns of the dataframe. For instance, say I have a dataFrame with these columns

pandas groupby 的默认行为是将 group by 列转换为索引并将它们从数据框的列列表中删除。例如,假设我有一个包含这些列的数据框

col1|col2|col3|col4

if I apply a groupby say with columns col2and col3this way

如果我用列col2col3这种方式应用 groupby 说

df.groupby(['col2','col3']).sum()

The dataframe dfno longer has the ['col2','col3']in the list of columns. They are automatically turned into the indices of the resulting dataframe.

数据框df不再包含['col2','col3']在列列表中。它们会自动转换为结果数据帧的索引。

My question is how can I perform groupby on a column and yet keep that column in the dataframe?

我的问题是如何对列执行 groupby 并将该列保留在数据框中?

采纳答案by user2034412

df.groupby(['col2','col3'], as_index=False).sum()

回答by Boudewijn Aasman

Another way to do this would be:

另一种方法是:

df.groupby(['col2', 'col3']).sum().reset_index()

回答by set92

Not sure, but I think the right answer would be

不确定,但我认为正确的答案是

df.groupby(['col2','col3']).sum()
df = df.reset_index()

At least is what I do all the time to avoid dataframes with multi-index.

至少是我一直在做的事情,以避免具有多索引的数据帧。

回答by Mohamed Ali JAMAOUI

The following, somewhat detailed answer, is added to help those who are still confused on which variant of the answers to use.

添加了以下稍微详细的答案,以帮助那些仍然对使用哪种答案变体感到困惑的人。

First, the suggested two solutions to this problem are:

首先,针对此问题建议的两种解决方案是:

  • Solution 1: df.groupby(['col2', 'col3'], as_index=False).sum()
  • Solution 2: df.groupby(['col2', 'col3']).sum().reset_index()
  • 解决方案1df.groupby(['col2', 'col3'], as_index=False).sum()
  • 解决方案2df.groupby(['col2', 'col3']).sum().reset_index()

Both give the expected result.

两者都给出了预期的结果。



Solution 1:

解决方案1:

As explained in the documentation, as_indexwill ask for SQL stylegrouped output, which will effectively ask pandas to preserve these grouped by columns in the output as it is prepared.

如文档中所述,as_index将要求SQL 样式分组输出,这将有效地要求 Pandas 在准备好输出时保留这些按列分组的输出。

as_index: bool, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style”grouped output.

as_index: bool,默认为 True

对于聚合输出,返回带有组标签的对象作为索引。仅与 DataFrame 输入相关。as_index=False 是有效的 “SQL 风格”分组输出。

Example:

例子:

Given the following Dataframe:

鉴于以下数据框:

  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.454158  0.723124
4    B     4  0.323326  0.895858
5    C     2  0.672375  0.356736
6    C     5  0.929655  0.371913
7    D     5  0.212634  0.540736
8    D     5  0.471418  0.268270
9    E     1  0.061270  0.739610

Applying the first solution gives:

应用第一个解决方案给出:

>>> df.groupby(["col1", "col2"], as_index=False).sum()

  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610

Where the groupby columns are preserved correctly.

正确保留 groupby 列的位置。



Solution 2:

解决方案2:

To understand the second solution, let's look at the output of the previous command with as_index = Truewhich is the default behavior of pandas.DataFrame.groupby(check documentation):

要理解第二个解决方案,让我们看看上一个命令的输出,as_index = True它是pandas.DataFrame.groupby(检查文档)的默认行为:

>>> df.groupby(["col1", "col2"], as_index=True).sum()
               col3      col4
col1 col2                    
A    1     0.502130  0.959404
     3     0.335416  0.087215
B    2     0.067308  0.084595
     4     0.777483  1.618982
C    2     0.672375  0.356736
     5     0.929655  0.371913
D    5     0.684052  0.809006
E    1     0.061270  0.739610

As you can see, the groupby keys become the index of the dataframe. Using, pandas.DataFrame.reset_index(check documentation) we can put back the indices of the dataframe as columns and use a default index. Which also leads us to the same results as in the previous step:

如您所见,groupby 键成为数据帧的索引。使用,pandas.DataFrame.reset_index(检查文档)我们可以将数据帧的索引作为列放回并使用默认索引。这也导致我们得到与上一步相同的结果:

>>> df.groupby(['col1', 'col2']).sum().reset_index()
  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610


Benchmark

基准

Notice that since the first solution achieves the requirement in 1 step versus 2 steps in the second solution, the former is slightly faster:

请注意,由于第一个解决方案在 1 步中实现了要求,而在第二个解决方案中则是 2 步,因此前者稍快:

%timeit df.groupby(["col1", "col2"], as_index=False).sum()
3.38 ms ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby(["col1", "col2"]).sum().reset_index()
3.9 ms ± 365 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)