如何识别 Python pandas Dataframe 中第一次出现的重复行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14952957/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:39:33  来源:igfitidea点击:

How to identify the first occurence of duplicate rows in Python pandas Dataframe

python-2.7dataframepandas

提问by user1652054

I have a pandas DataFrame with duplicate values for a set of columns. For example:

我有一个 Pandas DataFrame,其中包含一组列的重复值。例如:

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df
Out[2]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).

第 (1) 行和 (3) 行相同。本质上,第 (3) 行是第 (1) 行的副本。

I am looking for the following output:

我正在寻找以下输出:

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Is_Duplicate, 包含该行是否重复 [可以通过在数据框列(Column2、Column3 和 Column4)上使用“重复”方法来完成]

Dup_Indexthe original index of the duplicate row.

Dup_Index重复行的原始索引。

In [3]: df
Out[3]: 
   Column1 Column2 Column3  Column4  Is_Duplicate  Dup_Index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

回答by Andy Hayden

There is a DataFrame method duplicatedfor the first column:

duplicated第一列有一个 DataFrame 方法:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]: 
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

要做第二件事,你可以尝试这样的事情:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

回答by Boud

Let's say your dataframe is stored in df.

假设您的数据帧存储在df.

You can use groupbyto get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:

您可以使用groupby来获取数据帧的非重复行。这里我们必须忽略不属于数据的 Column1:

df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()

you can then merge this new dataframe with the original one by using the mergefunction:

然后,您可以使用合并功能将此新数据帧与原始数据帧合并

df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))

You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:

您最终可以使用合并在数据框中的 _dupindex 列来进行简单的数学运算来添加所需的列:

df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']