pandas 如何从数据框中删除重复项?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35433422/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:42:36  来源:igfitidea点击:

How to remove duplicates from a dataframe?

pythonpandas

提问by Peter

My current df looks like this:

我目前的 df 看起来像这样:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           NaN
      a            1          x           2008           1

The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator

原始数据指标是以下事实的结果:创建这些观察中的一些是为了获取每个 ID 编号的所有三年,而一些存在于原始数据集中。我想要实现的是删除重复项并最好保留原始数据。请注意,原始数据指标并不总是最后一个观察值。为了解决这个问题,我首先对 Idnumber Date Originaldataindicator 进行排序

However when I use:

但是,当我使用:

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])

Nothing happens and I still observe the duplicate.

什么也没发生,我仍然观察到重复。

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)

gives me an empty dataframe.

给了我一个空的数据框。

Am I misinterpreting what drop_duplicates does ?

我是否误解了 drop_duplicates 的作用?

Just to avoid confusion, this is what I want:

为了避免混淆,这就是我想要的:

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           1

The data includes thousands of these ID's

数据包括数千个这些 ID

采纳答案by jezrael

I think you need groupbyand sort_valuesand then use parameter keep=firstof drop_duplicates:

我想你需要groupbysort_values再使用参数keep=firstdrop_duplicates

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                    NaN

df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
              .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1
3        a      1        x  2008                    NaN
4        a      1        x  2008                    NaN

df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

Or use inplace:

或使用inplace

df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

If column Originaldataindicatorhave multiple values use duplicated(maybe ther can be add all columns IDnumber,Subid,Subsubid,Date) and isnull:

如果列Originaldataindicator有多个值使用duplicated(也许疗法可添加的所有列IDnumberSubidSubsubidDate)和isnull

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

Explaining conditions:

解释条件:

print df.duplicated('Date', keep=False)
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (pd.isnull(df['Originaldataindicator']))
0     True
1     True
2     True
3    False
4    False
Name: Originaldataindicator, dtype: bool

print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0     True
1     True
2    False
3     True
4     True
dtype: bool

回答by Ami Tavory

Consider this:

考虑一下:

df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})

Then

然后

>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index()
    a   b
0   1   1
1   2   2
2   3   1

Sorts the items by first a, then b(thus pushing the Nonevalues in each group last), then selects the first item per group.

按 first 对项目进行排序a,然后b(因此将None每个组中的值推到最后),然后选择每个组的第一个项目。

I believe you can modify this to the specifics of your problem.

我相信您可以根据您的问题的具体情况修改它。