Python 按熊猫数据框分组并在每组中选择最新的
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41525911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
group by pandas dataframe and select latest in each group
提问by DevEx
How to group values of pandas dataframe and select the latest(by date) from each group?
如何对熊猫数据框的值进行分组并从每组中选择最新的(按日期)?
For example, given a dataframe sorted by date:
例如,给定一个按日期排序的数据框:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
grouping by id or product, and selecting the earliest gives:
按 id 或产品分组,并选择最早的给出:
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
采纳答案by piRSquared
use idxmax
in groupby
and slice df
with loc
使用idxmax
中groupby
,切片df
与loc
df.loc[df.groupby('id').date.idxmax()]
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
回答by ade1e
You can also use tail
with groupby to get the last n values of the group:
您还可以使用tail
groupby 来获取组的最后 n 个值:
df.sort_values('date').groupby('id').tail(1)
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
回答by Damien Marlier
I had a similar problem and ended up using drop_duplicates
rather than groupby
.
我遇到了类似的问题,最终使用drop_duplicates
而不是groupby
.
It seems to run significatively faster on large datasets when compared with other methods suggested above.
与上面建议的其他方法相比,它在大型数据集上的运行速度似乎明显更快。
df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
回答by Sandu Ursu
Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:
给定按日期排序的数据框,您可以通过多种方式获得所需的内容:
Like this:
像这样:
df.groupby(['id','product']).last()
like this:
像这样:
df.groupby(['id','product']).nth(-1)
or like this:
或者像这样:
df.groupby(['id','product']).max()
If you don't want id
and product
to appear as index use groupby(['id', 'product'], as_index=False)
.
Alternatively use:
如果您不想id
并product
显示为索引,请使用groupby(['id', 'product'], as_index=False)
. 或者使用:
df.groupby(['id','product']).tail(1)
回答by Kristin Q
To use .tail()
as an aggregation method and keep your grouping intact:
要.tail()
用作聚合方法并保持分组不变:
df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
id product date
id
220 2 220 6647 2014-10-16
826 5 826 3380 2015-05-19
901 8 901 4555 2014-11-01