Python 做groupby时保留其他列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23394476/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Keep other columns when doing groupby
提问by PointXIV
I'm using groupby
on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
我在groupby
Pandas 数据框上使用删除没有特定列最小值的所有行。像这样的东西:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff
in my example) get dropped. Can I keep those columns using groupby
, or am I going to have to find a different way to drop the rows?
但是,如果我的列多于这两列,则其他列(例如otherstuff
在我的示例中)将被删除。我可以使用 保留这些列groupby
,还是必须找到不同的方法来删除行?
My data looks like:
我的数据看起来像:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
最终应该是这样的:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
但我得到的是:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
我一直在查看文档,但找不到任何内容。我试过:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
但是这些都不起作用(我在最后一个意识到语法用于在创建组后进行聚合)。
采纳答案by DSM
Method #1: use idxmin()
to get the indicesof the elements of minimum diff
, and then select those:
方法#1:使用idxmin()
获取minimum 元素的索引diff
,然后选择那些:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff
, and then take the first element in each item
group:
方法#2:按 排序diff
,然后取每组中的第一个元素item
:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
请注意,即使行内容相同,生成的索引也不同。
回答by citynorman
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin()
doesn't give you. This worked
如果有/你想要一分钟,上面的答案效果很好。在我的情况下,可能有多个分钟,我希望所有行都等于 min ,.idxmin()
这不会给你。这有效
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter()is also relevant to this question but didn't work for me.
回答by Nic Scozzaro
If you know that all of your "items" have more than one record you can sort, then use duplicated
:
如果您知道所有“项目”都有多个可以排序的记录,请使用duplicated
:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
回答by jezrael
You can use DataFrame.sort_values
with DataFrame.drop_duplicates
:
你可以用DataFrame.sort_values
与DataFrame.drop_duplicates
:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing
with transform
for minimal values per groups:
如果每个组可能的多个最小值和希望所有分列使用boolean indexing
与transform
每团最少值:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0