pandas 如何从数据框中删除重复项?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35433422/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove duplicates from a dataframe?
提问by Peter
My current df looks like this:
我目前的 df 看起来像这样:
IDnumber Subid Subsubid Date Originaldataindicator
a 1 x 2006 NaN
a 1 x 2007 NaN
a 1 x 2008 NaN
a 1 x 2008 1
The originaldataindicator is the result of the fact that some of these observations were created to get all three years for each IDnumber, while some existed in the original dataset. What I want to achieve is to drop the duplicates and prefarably keep the original data. Note that the originaldataindicator will not always be the last observation. To solve this I first sort on Idnumber Date Originaldataindicator
原始数据指标是以下事实的结果:创建这些观察中的一些是为了获取每个 ID 编号的所有三年,而一些存在于原始数据集中。我想要实现的是删除重复项并最好保留原始数据。请注意,原始数据指标并不总是最后一个观察值。为了解决这个问题,我首先对 Idnumber Date Originaldataindicator 进行排序
However when I use:
但是,当我使用:
df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'])
Nothing happens and I still observe the duplicate.
什么也没发生,我仍然观察到重复。
df=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], inplace=True)
gives me an empty dataframe.
给了我一个空的数据框。
Am I misinterpreting what drop_duplicates does ?
我是否误解了 drop_duplicates 的作用?
Just to avoid confusion, this is what I want:
为了避免混淆,这就是我想要的:
IDnumber Subid Subsubid Date Originaldataindicator
a 1 x 2006 NaN
a 1 x 2007 NaN
a 1 x 2008 1
The data includes thousands of these ID's
数据包括数千个这些 ID
采纳答案by jezrael
I think you need groupby
and sort_values
and then use parameter keep=first
of drop_duplicates
:
我想你需要groupby
和sort_values
再使用参数keep=first
的drop_duplicates
:
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 NaN
3 a 1 x 2008 1
4 a 1 x 2008 NaN
df = df.groupby(['IDnumber', 'Subid', 'Subsubid', 'Date'])
.apply(lambda x: x.sort_values('Originaldataindicator')).reset_index(drop=True)
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
3 a 1 x 2008 NaN
4 a 1 x 2008 NaN
df1=df.drop_duplicates(subset=['IDnumber', 'Subid', 'Subsubid', 'Date'], keep='first')
print df1
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
Or use inplace
:
或使用inplace
:
df.drop_duplicates(subset=['IDnumber','Subid','Subsubid','Date'], keep='first', inplace=True)
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 1
If column Originaldataindicator
have multiple values use duplicated
(maybe ther can be add all columns IDnumber
,Subid
,Subsubid
,Date
) and isnull
:
如果列Originaldataindicator
有多个值使用duplicated
(也许疗法可添加的所有列IDnumber
,Subid
,Subsubid
,Date
)和isnull
:
print df
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
2 a 1 x 2008 NaN
3 a 1 x 2008 1
4 a 1 x 2008 1
print df[~((df.duplicated('Date',keep=False))&~(pd.notnull(df['Originaldataindicator'])))]
IDnumber Subid Subsubid Date Originaldataindicator
0 a 1 x 2006 NaN
1 a 1 x 2007 NaN
3 a 1 x 2008 1
4 a 1 x 2008 1
Explaining conditions:
解释条件:
print df.duplicated('Date', keep=False)
0 False
1 False
2 True
3 True
4 True
dtype: bool
print (pd.isnull(df['Originaldataindicator']))
0 True
1 True
2 True
3 False
4 False
Name: Originaldataindicator, dtype: bool
print ~((df.duplicated('Date', keep=False)) & (pd.isnull(df['Originaldataindicator'])))
0 True
1 True
2 False
3 True
4 True
dtype: bool
回答by Ami Tavory
Consider this:
考虑一下:
df = pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, None, 1, None]})
Then
然后
>>> df.sort_values(by=['a', 'b']).groupby(df.a).first()[['b']].reset_index()
a b
0 1 1
1 2 2
2 3 1
Sorts the items by first a
, then b
(thus pushing the None
values in each group last), then selects the first item per group.
按 first 对项目进行排序a
,然后b
(因此将None
每个组中的值推到最后),然后选择每个组的第一个项目。
I believe you can modify this to the specifics of your problem.
我相信您可以根据您的问题的具体情况修改它。