pandas 熊猫:在 groupby 'date' 中删除重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37105609/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:11:59  来源:igfitidea点击:

pandas: drop duplicates in groupby 'date'

pythonpandasduplicatesuniquepandas-groupby

提问by Michael Perdue

In the dataframe below, I would like to eliminate the duplicate cidvalues so the output from df.groupby('date').cid.size()matches the output from df.groupby('date').cid.nunique().

在下面的数据cid框中,我想消除重复值,以便 的输出df.groupby('date').cid.size()df.groupby('date').cid.nunique().

I have looked at this postbut it does not seem to have a solid solution to the problem.

我看过这篇文章,但似乎没有解决问题的可靠方法。

df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df')

df.groupby('date').cid.size()

date
2005       7
2006     237
2007    3610
2008    1318
2009    2664
2010     997
2011    6390
2012    2904
2013    7875
2014    3979

df.groupby('date').cid.nunique()

date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
Name: cid, dtype: int64

Things I tried:

我尝试过的事情:

  1. df.groupby([df['date']]).drop_duplicates(cols='cid')gives this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
  2. df.groupby(('date').drop_duplicates('cid'))gives this error: AttributeError: 'str' object has no attribute 'drop_duplicates'
  1. df.groupby([df['date']]).drop_duplicates(cols='cid')给出这个错误: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
  2. df.groupby(('date').drop_duplicates('cid'))给出这个错误: AttributeError: 'str' object has no attribute 'drop_duplicates'

回答by ayhan

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

您不需要 groupby 根据几列删除重复项,您可以指定一个子集:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64