pandas 在熊猫数据框中找到重复的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47180983/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:45:01  来源:igfitidea点击:

find duplicate rows in a pandas dataframe

pythonpandasdataframeduplicates

提问by gabboshow

I am trying to find duplicates rows in a pandas dataframe.

我正在尝试在 Pandas 数据框中查找重复的行。

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

df
Out[15]: 
   col1  col2
0     1     2
1     3     4
2     1     2
3     1     4
4     1     2

duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]

duplicate
Out[16]: 
   col1  col2
2     1     2
4     1     2

Is there a way to add a column referring to the index of the first duplicate (the one kept)

有没有办法添加一列引用第一个副本的索引(保留的那个)

duplicate
Out[16]: 
   col1  col2  index_original
2     1     2               0
4     1     2               0

Note: df could be very very big in my case....

注意:在我的情况下 df 可能非常非常大....

回答by cs95

Use groupby, create a new column of indexes, and then call duplicated:

使用groupby,创建一个新的索引列,然后调用duplicated

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0


Details

细节

I groupbyfirst two columns and then call transform+ idxminto get the first index of each group.

groupby先两列,然后调用transform+idxmin来获取每个组的第一个索引。

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0    0
1    1
2    0
3    3
4    0
Name: col1, dtype: int64

duplicatedgives me a boolean mask of values I want to keep:

duplicated给了我一个我想保留的布尔值掩码:

df.duplicated(subset=['col1','col2'], keep='first')

0    False
1    False
2     True
3    False
4     True
dtype: bool

The rest is just boolean indexing.

其余的只是布尔索引