pandas 如何删除熊猫数据框中的唯一行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44888858/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to drop unique rows in a pandas dataframe?
提问by toto_tico
I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates()
.
我遇到了一个看似简单的问题:在 Pandas 数据框中删除唯一行。基本上是相反的drop_duplicates()
。
Let's say this is my data:
假设这是我的数据:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
当 A 和 B 是唯一的时,我想删除行,即我只想保留第 1 行和第 2 行。
I tried the following:
我尝试了以下方法:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
但我只得到第 2 行,因为 0、1 和 3 是唯一的!
回答by jezrael
Solutions for select all duplicated rows:
选择所有重复行的解决方案:
You can use duplicated
with subset and parameter keep=False
for select all duplicates:
您可以使用duplicated
子集和参数keep=False
来选择所有重复项:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform
:
解决方案transform
:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
选择所有唯一行的一些修改解决方案:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
回答by toto_tico
I came up with a solution using groupby
:
我想出了一个解决方案groupby
:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
Duplicates 现在有正确的结果:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False
in the drop_duplicates
method:
另外,我在这个问题原来尝试可以固定通过简单地增加keep=False
的drop_duplicates
方法:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please @jezrael answer, I think it is safest(?), as I am using pandas indexes here.
请@jezrael 回答,我认为这是最安全的(?),因为我在这里使用了Pandas索引。