pandas 在熊猫数据框中找到重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47180983/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
find duplicate rows in a pandas dataframe
提问by gabboshow
I am trying to find duplicates rows in a pandas dataframe.
我正在尝试在 Pandas 数据框中查找重复的行。
df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])
df
Out[15]:
col1 col2
0 1 2
1 3 4
2 1 2
3 1 4
4 1 2
duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]
duplicate
Out[16]:
col1 col2
2 1 2
4 1 2
Is there a way to add a column referring to the index of the first duplicate (the one kept)
有没有办法添加一列引用第一个副本的索引(保留的那个)
duplicate
Out[16]:
col1 col2 index_original
2 1 2 0
4 1 2 0
Note: df could be very very big in my case....
注意:在我的情况下 df 可能非常非常大....
回答by cs95
Use groupby
, create a new column of indexes, and then call duplicated
:
使用groupby
,创建一个新的索引列,然后调用duplicated
:
df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')
df[df.duplicated(subset=['col1','col2'], keep='first')]
col1 col2 index_original
2 1 2 0
4 1 2 0
Details
细节
I groupby
first two columns and then call transform
+ idxmin
to get the first index of each group.
我groupby
先两列,然后调用transform
+idxmin
来获取每个组的第一个索引。
df.groupby(['col1', 'col2']).col1.transform('idxmin')
0 0
1 1
2 0
3 3
4 0
Name: col1, dtype: int64
duplicated
gives me a boolean mask of values I want to keep:
duplicated
给了我一个我想保留的布尔值掩码:
df.duplicated(subset=['col1','col2'], keep='first')
0 False
1 False
2 True
3 False
4 True
dtype: bool
The rest is just boolean indexing.
其余的只是布尔索引。