从 Pandas 数据框中删除重复的行,其中只有某些列具有相同的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44481768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove duplicate rows from Pandas dataframe where only some columns have the same value
提问by beta
I have a pandas dataframe as follows:
我有一个Pandas数据框,如下所示:
A B C
1 2 x
1 2 y
3 4 z
3 5 x
I want that only 1 row remains of rows that share the same values in specific columns. In the example above I mean columns Aand B. In other words, if the values of columns Aand Boccur more than once in the dataframe, only one row should remain (which one does not matter).
我希望只剩下 1 行在特定列中共享相同值的行。在上面的例子中,我的意思是A和B列。换句话说,如果A列和B列的值在数据框中出现多次,则只应保留一行(哪一行无关紧要)。
FWIW: the maximum number of so called duplicate rows (that is, where column Aand Bare the same) is 2.
FWIW:所谓的重复行(即A列和B列相同)的最大数量为 2。
The result should looke like this:
结果应该是这样的:
A B C
1 2 x
3 4 z
3 5 x
or
或者
A B C
1 2 y
3 4 z
3 5 x
回答by jezrael
Use drop_duplicates
with parameter subset
, for keeping only last duplicated rows add keep='last'
:
drop_duplicates
与参数一起使用,subset
仅保留最后重复的行添加keep='last'
:
df1 = df.drop_duplicates(subset=['A','B'])
#same as
#df1 = df.drop_duplicates(subset=['A','B'], keep='first')
print (df1)
A B C
0 1 2 x
2 3 4 z
3 3 5 x
df2 = df.drop_duplicates(subset=['A','B'], keep='last')
print (df2)
A B C
1 1 2 y
2 3 4 z
3 3 5 x