从 Pandas 数据框中删除重复的行,其中只有某些列具有相同的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44481768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:45:46  来源:igfitidea点击:

Remove duplicate rows from Pandas dataframe where only some columns have the same value

pythonpandasdataframeduplicates

提问by beta

I have a pandas dataframe as follows:

我有一个Pandas数据框,如下所示:

A   B   C
1   2   x
1   2   y
3   4   z
3   5   x

I want that only 1 row remains of rows that share the same values in specific columns. In the example above I mean columns Aand B. In other words, if the values of columns Aand Boccur more than once in the dataframe, only one row should remain (which one does not matter).

我希望只剩下 1 行在特定列中共享相同值的行。在上面的例子中,我的意思是AB列。换句话说,如果A列和B列的值在数据框中出现多次,则只应保留一行(哪一行无关紧要)。

FWIW: the maximum number of so called duplicate rows (that is, where column Aand Bare the same) is 2.

FWIW:所谓的重复行(即A列和B列相同)的最大数量为 2。

The result should looke like this:

结果应该是这样的:

A   B   C
1   2   x
3   4   z
3   5   x

or

或者

A   B   C
1   2   y
3   4   z
3   5   x

回答by jezrael

Use drop_duplicateswith parameter subset, for keeping only last duplicated rows add keep='last':

drop_duplicates与参数一起使用,subset仅保留最后重复的行添加keep='last'

df1 = df.drop_duplicates(subset=['A','B'])
#same as
#df1 = df.drop_duplicates(subset=['A','B'], keep='first')
print (df1)
   A  B  C
0  1  2  x
2  3  4  z
3  3  5  x


df2 = df.drop_duplicates(subset=['A','B'], keep='last')
print (df2)
   A  B  C
1  1  2  y
2  3  4  z
3  3  5  x