Pandas 在一列上分组,另一列 python 上的最大日期

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48754049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:10:03  来源:igfitidea点击:

Pandas group by on one column with max date on another column python

python-2.7pandas

提问by Anurag Rawat

i have a dataframe with following data :

我有一个包含以下数据的数据框:

invoice_no  dealer  billing_change_previous_month        date
       110       1                              0  2016-12-31
       100       1                         -41981  2017-01-30
      5505       2                              0  2017-01-30
      5635       2                          58730  2016-12-31

i want to have only one dealer with the maximum date . The desired output should be like this :

我只想有一个最大日期的经销商。所需的输出应该是这样的:

invoice_no  dealer  billing_change_previous_month        date
       100       1                         -41981  2017-01-30
      5505       2                              0  2017-01-30

each dealer should be distinct with maximum date, thanks in advance for your help.

每个经销商应与最大日期不同,在此先感谢您的帮助。

采纳答案by Vaishali

You can use boolean indexing using groupby and transform

您可以使用 groupby 和转换来使用布尔索引

df_new = df[df.groupby('dealer').date.transform('max') == df['date']]

    invoice_no  dealer  billing_change_previous_month   date
1   100         1       -41981                          2017-01-30
2   5505        2       0                               2017-01-30

If there are more than two dealers,

如果有两个以上的经销商,

df = pd.DataFrame({'invoice_no':[110,100,5505,5635,10000,10001], 'dealer':[1,1,2,2,3,3],'billing_change_previous_month':[0,-41981,0,58730,9000,100], 'date':['2016-12-31','2017-01-30','2017-01-30','2016-12-31', '2019-12-31', '2020-01-31']})

df['date'] = pd.to_datetime(df['date'])
df[df.groupby('dealer').date.transform('max') == df['date']]


    invoice_no  dealer  billing_change_previous_month   date
1   100         1       -41981                          2017-01-30
2   5505        2       0                               2017-01-30
5   10001       3       100                             2020-01-31

回答by 3novak

Tack 1

大头针 1

Sort by dealer and by date before using drop_duplicates. This is blind to the issue that surfaces in Tack 2, below since there is no possibility for multiple records for each dealer in this method. This may or may not be an issue for you depending on your data and your use case.

在使用drop_duplicates之前按经销商和日期排序。这对下面 Tack 2 中出现的问题视而不见,因为在这种方法中每个经销商不可能有多个记录。根据您的数据和用例,这对您来说可能是也可能不是问题。

df.sort_values(['dealer', 'date'], inplace=True)
df.drop_duplicates(['dealer', 'date'], inplace=True)

Tack 2

大头针 2

This is a worse way to do it with a groupbyand a merge. Use groupbyto find the max date for each dealer. We use the how='inner'parameter to only include those dealer and date combinations that appear in the groupby object that contains the maximum date for each dealer.

这是用groupbymerge来做的更糟糕的方法。使用groupby查找每个经销商的最大日期。我们使用该how='inner'参数仅包含出现在 groupby 对象中的那些经销商和日期组合,该对象包含每个经销商的最大日期。

However, please note that this will return multiple records per dealer if the max date is duplicated in the original table. You might need to use drop_duplicatesdepending on your data and your use case.

但是,请注意,如果原始表中的最大日期重复,这将返回每个经销商的多条记录。根据您的数据和用例,您可能需要使用drop_duplicates

df.merge(df.groupby('dealer')['date'].max().reset_index(), 
                             on=['dealer', 'date'], how='inner')

   invoice_no  dealer  billing_change_previous_month        date
0         100       1                         -41981  2017-01-30
1        5505       2                              0  2017-01-30

回答by Rufat

Here https://stackoverflow.com/a/41531127/9913319is more correct solution:

这里https://stackoverflow.com/a/41531127/9913319是更正确的解决方案:

df.sort_values('date').groupby('dealer').tail(1)