pandas 在熊猫数据框中找到重复的行

Question

提问by gabboshow

I am trying to find duplicates rows in a pandas dataframe.

我正在尝试在 Pandas 数据框中查找重复的行。

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

df
Out[15]: 
   col1  col2
0     1     2
1     3     4
2     1     2
3     1     4
4     1     2

duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]

duplicate
Out[16]: 
   col1  col2
2     1     2
4     1     2

Is there a way to add a column referring to the index of the first duplicate (the one kept)

有没有办法添加一列引用第一个副本的索引（保留的那个）

duplicate
Out[16]: 
   col1  col2  index_original
2     1     2               0
4     1     2               0

Note: df could be very very big in my case....

注意：在我的情况下 df 可能非常非常大....

Answer 1

回答by cs95

Use groupby, create a new column of indexes, and then call duplicated:

使用groupby，创建一个新的索引列，然后调用duplicated：

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0

Details

细节

I groupbyfirst two columns and then call transform+ idxminto get the first index of each group.

我groupby先两列，然后调用transform+idxmin来获取每个组的第一个索引。

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0    0
1    1
2    0
3    3
4    0
Name: col1, dtype: int64

duplicatedgives me a boolean mask of values I want to keep:

duplicated给了我一个我想保留的布尔值掩码：

df.duplicated(subset=['col1','col2'], keep='first')

0    False
1    False
2     True
3    False
4     True
dtype: bool

The rest is just boolean indexing.

其余的只是布尔索引。

pandas 在熊猫数据框中找到重复的行

提问by gabboshow

回答by cs95

相关推荐

最近更新

标签

pandas 在熊猫数据框中找到重复的行

提问by gabboshow

回答by cs95

相关推荐

pandas 根据空值的百分比删除熊猫数据框中的列

pandas 使用 Python 将表从一个数据库复制到 SQL Server 中的另一个数据库

使用滚动中值过滤掉 Pandas 数据框中的异常值

pandas XGBoost plot_importance 不显示特征名称

相关推荐

最近更新

标签