如何根据 Pandas 数据框中的两个或多个子集条件删除重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45497835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:11:06  来源:igfitidea点击:

How to drop duplicates based on two or more subsets criteria in Pandas data-frame

pythonpandasdataframepandas-groupby

提问by logic8

Lets say this is my data-frame

假设这是我的数据框

df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
                'center' : ['one', 'one', 'two', 'three'],
                'outcome' : ['f','t','f','f'] })

It looks like this ...

看起来像这样...

  bio center outcome
0   1    one       f
1   1    one       t
2   1    two       f
3   4  three       f

I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0.

我想删除第 1 行,因为它与第 0 行具有相同的生物和中心。我想保留第 2 行,因为它具有相同的生物但中心与第 0 行不同。

Something like this won't work based on drop_duplicates input structure but it's what I am trying to do

基于 drop_duplicates 输入结构,这样的事情将不起作用,但这是我正在尝试做的

df.drop_duplicates(subset = 'bio' & subset = 'center' )

Any suggestions ?

有什么建议 ?

edit : changed df a bit to fit example by correct answer

编辑:通过正确答案稍微更改 df 以适合示例

回答by Gustavo Bezerra

Your syntax is wrong. Here's the correct way:

你的语法是错误的。这是正确的方法:

df.drop_duplicates(subset=['bio', 'center', 'outcome'])

Or in this specific case, just simply:

或者在这种特定情况下,只需简单地:

df.drop_duplicates()

Both return the following:

两者都返回以下内容:

  bio center outcome
0   1    one       f
2   1    two       f
3   4  three       f

Take a look at the df.drop_duplicatesdocumentationfor syntax details. subsetshould be a sequence of column labels.

查看df.drop_duplicates文档以了解语法详细信息。subset应该是一系列的列标签。