Python 做groupby时保留其他列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23394476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:55:10  来源:igfitidea点击:

Keep other columns when doing groupby

pythonpandasaggregatepandas-groupby

提问by PointXIV

I'm using groupbyon a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

我在groupbyPandas 数据框上使用删除没有特定列最小值的所有行。像这样的东西:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuffin my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

但是,如果我的列多于这两列,则其他列(例如otherstuff在我的示例中)将被删除。我可以使用 保留这些列groupby,还是必须找到不同的方法来删除行?

My data looks like:

我的数据看起来像:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

最终应该是这样的:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

但我得到的是:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0                 

I've been looking through the documentation and can't find anything. I tried:

我一直在查看文档,但找不到任何内容。我试过:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

但是这些都不起作用(我在最后一个意识到语法用于在创建组后进行聚合)。

采纳答案by DSM

Method #1: use idxmin()to get the indicesof the elements of minimum diff, and then select those:

方法#1:使用idxmin()获取minimum 元素的索引diff,然后选择那些:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each itemgroup:

方法#2:按 排序diff,然后取每组中的第一个元素item

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

请注意,即使行内容相同,生成的索引也不同。

回答by citynorman

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin()doesn't give you. This worked

如果有/你想要一分钟,上面的答案效果很好。在我的情况下,可能有多个分钟,我希望所有行都等于 min ,.idxmin()这不会给你。这有效

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter()is also relevant to this question but didn't work for me.

顺便说一句.filter()也与这个问题有关,但对我不起作用。

回答by Nic Scozzaro

If you know that all of your "items" have more than one record you can sort, then use duplicated:

如果您知道所有“项目”都有多个可以排序的记录,请使用duplicated

df.sort_values(by='diff').duplicated(subset='item', keep='first')

回答by jezrael

You can use DataFrame.sort_valueswith DataFrame.drop_duplicates:

你可以用DataFrame.sort_valuesDataFrame.drop_duplicates

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexingwith transformfor minimal values per groups:

如果每个组可能的多个最小值和希望所有分列使用boolean indexingtransform每团最少值:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0