如何根据 Pandas 数据框中的两个或多个子集条件删除重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45497835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to drop duplicates based on two or more subsets criteria in Pandas data-frame
提问by logic8
Lets say this is my data-frame
假设这是我的数据框
df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
'center' : ['one', 'one', 'two', 'three'],
'outcome' : ['f','t','f','f'] })
It looks like this ...
看起来像这样...
bio center outcome
0 1 one f
1 1 one t
2 1 two f
3 4 three f
I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0.
我想删除第 1 行,因为它与第 0 行具有相同的生物和中心。我想保留第 2 行,因为它具有相同的生物但中心与第 0 行不同。
Something like this won't work based on drop_duplicates input structure but it's what I am trying to do
基于 drop_duplicates 输入结构,这样的事情将不起作用,但这是我正在尝试做的
df.drop_duplicates(subset = 'bio' & subset = 'center' )
Any suggestions ?
有什么建议 ?
edit : changed df a bit to fit example by correct answer
编辑:通过正确答案稍微更改 df 以适合示例
回答by Gustavo Bezerra
Your syntax is wrong. Here's the correct way:
你的语法是错误的。这是正确的方法:
df.drop_duplicates(subset=['bio', 'center', 'outcome'])
Or in this specific case, just simply:
或者在这种特定情况下,只需简单地:
df.drop_duplicates()
Both return the following:
两者都返回以下内容:
bio center outcome
0 1 one f
2 1 two f
3 4 three f
Take a look at the df.drop_duplicates
documentationfor syntax details. subset
should be a sequence of column labels.
查看df.drop_duplicates
文档以了解语法详细信息。subset
应该是一系列的列标签。