Python pandas groupby 没有将按列分组转换为索引
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32059397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas groupby without turning grouped by column into index
提问by Mohamed Ali JAMAOUI
The default behavior of pandas groupby is to turn the group by columns into index and remove them from the list of columns of the dataframe. For instance, say I have a dataFrame with these columns
pandas groupby 的默认行为是将 group by 列转换为索引并将它们从数据框的列列表中删除。例如,假设我有一个包含这些列的数据框
col1|col2|col3|col4
if I apply a groupby say with columns col2
and col3
this way
如果我用列col2
和col3
这种方式应用 groupby 说
df.groupby(['col2','col3']).sum()
The dataframe df
no longer has the ['col2','col3']
in the list of columns. They are automatically turned into the indices of the resulting dataframe.
数据框df
不再包含['col2','col3']
在列列表中。它们会自动转换为结果数据帧的索引。
My question is how can I perform groupby on a column and yet keep that column in the dataframe?
我的问题是如何对列执行 groupby 并将该列保留在数据框中?
采纳答案by user2034412
df.groupby(['col2','col3'], as_index=False).sum()
回答by Boudewijn Aasman
Another way to do this would be:
另一种方法是:
df.groupby(['col2', 'col3']).sum().reset_index()
回答by set92
Not sure, but I think the right answer would be
不确定,但我认为正确的答案是
df.groupby(['col2','col3']).sum()
df = df.reset_index()
At least is what I do all the time to avoid dataframes with multi-index.
至少是我一直在做的事情,以避免具有多索引的数据帧。
回答by Mohamed Ali JAMAOUI
The following, somewhat detailed answer, is added to help those who are still confused on which variant of the answers to use.
添加了以下稍微详细的答案,以帮助那些仍然对使用哪种答案变体感到困惑的人。
First, the suggested two solutions to this problem are:
首先,针对此问题建议的两种解决方案是:
- Solution 1:
df.groupby(['col2', 'col3'], as_index=False).sum()
- Solution 2:
df.groupby(['col2', 'col3']).sum().reset_index()
- 解决方案1:
df.groupby(['col2', 'col3'], as_index=False).sum()
- 解决方案2:
df.groupby(['col2', 'col3']).sum().reset_index()
Both give the expected result.
两者都给出了预期的结果。
Solution 1:
解决方案1:
As explained in the documentation, as_index
will ask for SQL stylegrouped output, which will effectively ask pandas to preserve these grouped by columns in the output as it is prepared.
如文档中所述,as_index
将要求SQL 样式分组输出,这将有效地要求 Pandas 在准备好输出时保留这些按列分组的输出。
as_index: bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style”grouped output.
as_index: bool,默认为 True
对于聚合输出,返回带有组标签的对象作为索引。仅与 DataFrame 输入相关。as_index=False 是有效的 “SQL 风格”分组输出。
Example:
例子:
Given the following Dataframe:
鉴于以下数据框:
col1 col2 col3 col4
0 A 1 0.502130 0.959404
1 A 3 0.335416 0.087215
2 B 2 0.067308 0.084595
3 B 4 0.454158 0.723124
4 B 4 0.323326 0.895858
5 C 2 0.672375 0.356736
6 C 5 0.929655 0.371913
7 D 5 0.212634 0.540736
8 D 5 0.471418 0.268270
9 E 1 0.061270 0.739610
Applying the first solution gives:
应用第一个解决方案给出:
>>> df.groupby(["col1", "col2"], as_index=False).sum()
col1 col2 col3 col4
0 A 1 0.502130 0.959404
1 A 3 0.335416 0.087215
2 B 2 0.067308 0.084595
3 B 4 0.777483 1.618982
4 C 2 0.672375 0.356736
5 C 5 0.929655 0.371913
6 D 5 0.684052 0.809006
7 E 1 0.061270 0.739610
Where the groupby columns are preserved correctly.
正确保留 groupby 列的位置。
Solution 2:
解决方案2:
To understand the second solution, let's look at the output of the previous command with as_index = True
which is the default behavior of pandas.DataFrame.groupby
(check documentation):
要理解第二个解决方案,让我们看看上一个命令的输出,as_index = True
它是pandas.DataFrame.groupby
(检查文档)的默认行为:
>>> df.groupby(["col1", "col2"], as_index=True).sum()
col3 col4
col1 col2
A 1 0.502130 0.959404
3 0.335416 0.087215
B 2 0.067308 0.084595
4 0.777483 1.618982
C 2 0.672375 0.356736
5 0.929655 0.371913
D 5 0.684052 0.809006
E 1 0.061270 0.739610
As you can see, the groupby keys become the index of the dataframe. Using, pandas.DataFrame.reset_index
(check documentation) we can put back the indices of the dataframe as columns and use a default index. Which also leads us to the same results as in the previous step:
如您所见,groupby 键成为数据帧的索引。使用,pandas.DataFrame.reset_index
(检查文档)我们可以将数据帧的索引作为列放回并使用默认索引。这也导致我们得到与上一步相同的结果:
>>> df.groupby(['col1', 'col2']).sum().reset_index()
col1 col2 col3 col4
0 A 1 0.502130 0.959404
1 A 3 0.335416 0.087215
2 B 2 0.067308 0.084595
3 B 4 0.777483 1.618982
4 C 2 0.672375 0.356736
5 C 5 0.929655 0.371913
6 D 5 0.684052 0.809006
7 E 1 0.061270 0.739610
Benchmark
基准
Notice that since the first solution achieves the requirement in 1 step versus 2 steps in the second solution, the former is slightly faster:
请注意,由于第一个解决方案在 1 步中实现了要求,而在第二个解决方案中则是 2 步,因此前者稍快:
%timeit df.groupby(["col1", "col2"], as_index=False).sum()
3.38 ms ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby(["col1", "col2"]).sum().reset_index()
3.9 ms ± 365 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)