pandas 如何从数据框中删除重复项？

Question

提问by Peter

My current df looks like this:

我目前的 df 看起来像这样：

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           NaN
      a            1          x           2008           1

The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator

原始数据指标是以下事实的结果：创建这些观察中的一些是为了获取每个 ID 编号的所有三年，而一些存在于原始数据集中。我想要实现的是删除重复项并最好保留原始数据。请注意，原始数据指标并不总是最后一个观察值。为了解决这个问题，我首先对 Idnumber Date Originaldataindicator 进行排序

However when I use:

但是，当我使用：

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])

Nothing happens and I still observe the duplicate.

什么也没发生，我仍然观察到重复。

  df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)

gives me an empty dataframe.

给了我一个空的数据框。

Am I misinterpreting what drop_duplicates does ?

我是否误解了 drop_duplicates 的作用？

Just to avoid confusion, this is what I want:

为了避免混淆，这就是我想要的：

    IDnumber     Subid      Subsubid     Date        Originaldataindicator
      a            1          x           2006           NaN
      a            1          x           2007           NaN
      a            1          x           2008           1

The data includes thousands of these ID's

数据包括数千个这些 ID

Answer 1

采纳答案by jezrael

I think you need groupbyand sort_valuesand then use parameter keep=firstof drop_duplicates:

我想你需要groupby和sort_values再使用参数keep=first的drop_duplicates：

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                    NaN

df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
              .apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1
3        a      1        x  2008                    NaN
4        a      1        x  2008                    NaN

df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

Or use inplace:

或使用inplace：

df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                      1

If column Originaldataindicatorhave multiple values use duplicated(maybe ther can be add all columns IDnumber,Subid,Subsubid,Date) and isnull:

如果列Originaldataindicator有多个值使用duplicated（也许疗法可添加的所有列IDnumber，Subid，Subsubid，Date）和isnull：

print df
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
2        a      1        x  2008                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
  IDnumber  Subid Subsubid  Date  Originaldataindicator
0        a      1        x  2006                    NaN
1        a      1        x  2007                    NaN
3        a      1        x  2008                      1
4        a      1        x  2008                      1

Explaining conditions:

解释条件：

print df.duplicated('Date', keep=False)
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (pd.isnull(df['Originaldataindicator']))
0     True
1     True
2     True
3    False
4    False
Name: Originaldataindicator, dtype: bool

print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0     True
1     True
2    False
3     True
4     True
dtype: bool

Answer 2

回答by Ami Tavory

Consider this:

考虑一下：

df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})

Then

然后

>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index()
    a   b
0   1   1
1   2   2
2   3   1

Sorts the items by first a, then b(thus pushing the Nonevalues in each group last), then selects the first item per group.

按 first 对项目进行排序a，然后b（因此将None每个组中的值推到最后），然后选择每个组的第一个项目。

I believe you can modify this to the specifics of your problem.

我相信您可以根据您的问题的具体情况修改它。

pandas 如何从数据框中删除重复项？

提问by Peter

采纳答案by jezrael

回答by Ami Tavory

相关推荐

最近更新

标签

pandas 如何从数据框中删除重复项？

提问by Peter

采纳答案by jezrael

回答by Ami Tavory

相关推荐

pandas 熊猫无效类型比较错误

pandas 熊猫交叉加入没有共同的列

Django Pandas 到 http 响应（下载文件）

Pandas 将变量名传递给列名

相关推荐

最近更新

标签