Pandas 在一列上分组,另一列 python 上的最大日期
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48754049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas group by on one column with max date on another column python
提问by Anurag Rawat
i have a dataframe with following data :
我有一个包含以下数据的数据框:
invoice_no dealer billing_change_previous_month date
110 1 0 2016-12-31
100 1 -41981 2017-01-30
5505 2 0 2017-01-30
5635 2 58730 2016-12-31
i want to have only one dealer with the maximum date . The desired output should be like this :
我只想有一个最大日期的经销商。所需的输出应该是这样的:
invoice_no dealer billing_change_previous_month date
100 1 -41981 2017-01-30
5505 2 0 2017-01-30
each dealer should be distinct with maximum date, thanks in advance for your help.
每个经销商应与最大日期不同,在此先感谢您的帮助。
采纳答案by Vaishali
You can use boolean indexing using groupby and transform
您可以使用 groupby 和转换来使用布尔索引
df_new = df[df.groupby('dealer').date.transform('max') == df['date']]
invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30
If there are more than two dealers,
如果有两个以上的经销商,
df = pd.DataFrame({'invoice_no':[110,100,5505,5635,10000,10001], 'dealer':[1,1,2,2,3,3],'billing_change_previous_month':[0,-41981,0,58730,9000,100], 'date':['2016-12-31','2017-01-30','2017-01-30','2016-12-31', '2019-12-31', '2020-01-31']})
df['date'] = pd.to_datetime(df['date'])
df[df.groupby('dealer').date.transform('max') == df['date']]
invoice_no dealer billing_change_previous_month date
1 100 1 -41981 2017-01-30
2 5505 2 0 2017-01-30
5 10001 3 100 2020-01-31
回答by 3novak
Tack 1
大头针 1
Sort by dealer and by date before using drop_duplicates. This is blind to the issue that surfaces in Tack 2, below since there is no possibility for multiple records for each dealer in this method. This may or may not be an issue for you depending on your data and your use case.
在使用drop_duplicates之前按经销商和日期排序。这对下面 Tack 2 中出现的问题视而不见,因为在这种方法中每个经销商不可能有多个记录。根据您的数据和用例,这对您来说可能是也可能不是问题。
df.sort_values(['dealer', 'date'], inplace=True)
df.drop_duplicates(['dealer', 'date'], inplace=True)
Tack 2
大头针 2
This is a worse way to do it with a groupbyand a merge. Use groupby
to find the max date for each dealer. We use the how='inner'
parameter to only include those dealer and date combinations that appear in the groupby object that contains the maximum date for each dealer.
这是用groupby和merge来做的更糟糕的方法。使用groupby
查找每个经销商的最大日期。我们使用该how='inner'
参数仅包含出现在 groupby 对象中的那些经销商和日期组合,该对象包含每个经销商的最大日期。
However, please note that this will return multiple records per dealer if the max date is duplicated in the original table. You might need to use drop_duplicatesdepending on your data and your use case.
但是,请注意,如果原始表中的最大日期重复,这将返回每个经销商的多条记录。根据您的数据和用例,您可能需要使用drop_duplicates。
df.merge(df.groupby('dealer')['date'].max().reset_index(),
on=['dealer', 'date'], how='inner')
invoice_no dealer billing_change_previous_month date
0 100 1 -41981 2017-01-30
1 5505 2 0 2017-01-30
回答by Rufat
Here https://stackoverflow.com/a/41531127/9913319is more correct solution:
这里https://stackoverflow.com/a/41531127/9913319是更正确的解决方案:
df.sort_values('date').groupby('dealer').tail(1)