Python pandas groupby 没有将按列分组转换为索引

Question

提问by Mohamed Ali JAMAOUI

The default behavior of pandas groupby is to turn the group by columns into index and remove them from the list of columns of the dataframe. For instance, say I have a dataFrame with these columns

pandas groupby 的默认行为是将 group by 列转换为索引并将它们从数据框的列列表中删除。例如，假设我有一个包含这些列的数据框

col1|col2|col3|col4

if I apply a groupby say with columns col2and col3this way

如果我用列col2和col3这种方式应用 groupby 说

df.groupby(['col2','col3']).sum()

The dataframe dfno longer has the ['col2','col3']in the list of columns. They are automatically turned into the indices of the resulting dataframe.

数据框df不再包含['col2','col3']在列列表中。它们会自动转换为结果数据帧的索引。

My question is how can I perform groupby on a column and yet keep that column in the dataframe?

我的问题是如何对列执行 groupby 并将该列保留在数据框中？

Answer 1

采纳答案by user2034412

df.groupby(['col2','col3'], as_index=False).sum()

Answer 2

回答by Boudewijn Aasman

Another way to do this would be:

另一种方法是：

df.groupby(['col2', 'col3']).sum().reset_index()

Answer 3

回答by set92

Not sure, but I think the right answer would be

不确定，但我认为正确的答案是

df.groupby(['col2','col3']).sum()
df = df.reset_index()

At least is what I do all the time to avoid dataframes with multi-index.

至少是我一直在做的事情，以避免具有多索引的数据帧。

Answer 4

回答by Mohamed Ali JAMAOUI

The following, somewhat detailed answer, is added to help those who are still confused on which variant of the answers to use.

添加了以下稍微详细的答案，以帮助那些仍然对使用哪种答案变体感到困惑的人。

First, the suggested two solutions to this problem are:

首先，针对此问题建议的两种解决方案是：

Solution 1: df.groupby(['col2', 'col3'], as_index=False).sum()
Solution 2: df.groupby(['col2', 'col3']).sum().reset_index()

解决方案1：df.groupby(['col2', 'col3'], as_index=False).sum()
解决方案2：df.groupby(['col2', 'col3']).sum().reset_index()

Both give the expected result.

两者都给出了预期的结果。

Solution 1:

解决方案1：

As explained in the documentation, as_indexwill ask for SQL stylegrouped output, which will effectively ask pandas to preserve these grouped by columns in the output as it is prepared.

如文档中所述，as_index将要求SQL 样式分组输出，这将有效地要求 Pandas 在准备好输出时保留这些按列分组的输出。

as_index: bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style”grouped output.

as_index: bool，默认为 True
对于聚合输出，返回带有组标签的对象作为索引。仅与 DataFrame 输入相关。as_index=False 是有效的 “SQL 风格”分组输出。

Example:

例子：

Given the following Dataframe:

鉴于以下数据框：

  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.454158  0.723124
4    B     4  0.323326  0.895858
5    C     2  0.672375  0.356736
6    C     5  0.929655  0.371913
7    D     5  0.212634  0.540736
8    D     5  0.471418  0.268270
9    E     1  0.061270  0.739610

Applying the first solution gives:

应用第一个解决方案给出：

>>> df.groupby(["col1", "col2"], as_index=False).sum()

  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610

Where the groupby columns are preserved correctly.

正确保留 groupby 列的位置。

Solution 2:

解决方案2：

To understand the second solution, let's look at the output of the previous command with as_index = Truewhich is the default behavior of pandas.DataFrame.groupby(check documentation):

要理解第二个解决方案，让我们看看上一个命令的输出，as_index = True它是pandas.DataFrame.groupby（检查文档）的默认行为：

>>> df.groupby(["col1", "col2"], as_index=True).sum()
               col3      col4
col1 col2                    
A    1     0.502130  0.959404
     3     0.335416  0.087215
B    2     0.067308  0.084595
     4     0.777483  1.618982
C    2     0.672375  0.356736
     5     0.929655  0.371913
D    5     0.684052  0.809006
E    1     0.061270  0.739610

As you can see, the groupby keys become the index of the dataframe. Using, pandas.DataFrame.reset_index(check documentation) we can put back the indices of the dataframe as columns and use a default index. Which also leads us to the same results as in the previous step:

如您所见，groupby 键成为数据帧的索引。使用，pandas.DataFrame.reset_index（检查文档）我们可以将数据帧的索引作为列放回并使用默认索引。这也导致我们得到与上一步相同的结果：

>>> df.groupby(['col1', 'col2']).sum().reset_index()
  col1  col2      col3      col4
0    A     1  0.502130  0.959404
1    A     3  0.335416  0.087215
2    B     2  0.067308  0.084595
3    B     4  0.777483  1.618982
4    C     2  0.672375  0.356736
5    C     5  0.929655  0.371913
6    D     5  0.684052  0.809006
7    E     1  0.061270  0.739610

Benchmark

基准

Notice that since the first solution achieves the requirement in 1 step versus 2 steps in the second solution, the former is slightly faster:

请注意，由于第一个解决方案在 1 步中实现了要求，而在第二个解决方案中则是 2 步，因此前者稍快：

%timeit df.groupby(["col1", "col2"], as_index=False).sum()
3.38 ms ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby(["col1", "col2"]).sum().reset_index()
3.9 ms ± 365 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python pandas groupby 没有将按列分组转换为索引

提问by Mohamed Ali JAMAOUI

采纳答案by user2034412

回答by Boudewijn Aasman

回答by set92

回答by Mohamed Ali JAMAOUI

Solution 1:

解决方案1：

Solution 2:

解决方案2：

Benchmark

基准

相关推荐

最近更新

标签

Python pandas groupby 没有将按列分组转换为索引

提问by Mohamed Ali JAMAOUI

采纳答案by user2034412

回答by Boudewijn Aasman

回答by set92

回答by Mohamed Ali JAMAOUI

Solution 1:

解决方案1：

Solution 2:

解决方案2：

Benchmark

基准

相关推荐

如何使用 Python 将字节数组发送到串行端口？

Python 合并hdf5文件

Python 使用烧瓶从选择标签中获取价值

如何从python中的正则表达式匹配返回字符串？

相关推荐

最近更新

标签