如何识别 Python pandas Dataframe 中第一次出现的重复行

Question

提问by user1652054

I have a pandas DataFrame with duplicate values for a set of columns. For example:

我有一个 Pandas DataFrame，其中包含一组列的重复值。例如：

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df
Out[2]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).

第 (1) 行和 (3) 行相同。本质上，第 (3) 行是第 (1) 行的副本。

I am looking for the following output:

我正在寻找以下输出：

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Is_Duplicate, 包含该行是否重复 [可以通过在数据框列（Column2、Column3 和 Column4）上使用“重复”方法来完成]

Dup_Indexthe original index of the duplicate row.

Dup_Index重复行的原始索引。

In [3]: df
Out[3]: 
   Column1 Column2 Column3  Column4  Is_Duplicate  Dup_Index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Answer 1

回答by Andy Hayden

There is a DataFrame method duplicatedfor the first column:

duplicated第一列有一个 DataFrame 方法：

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]: 
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

要做第二件事，你可以尝试这样的事情：

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Answer 2

回答by Boud

Let's say your dataframe is stored in df.

假设您的数据帧存储在df.

You can use groupbyto get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:

您可以使用groupby来获取数据帧的非重复行。这里我们必须忽略不属于数据的 Column1：

df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()

you can then merge this new dataframe with the original one by using the mergefunction:

然后，您可以使用合并功能将此新数据帧与原始数据帧合并：

df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))

You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:

您最终可以使用合并在数据框中的 _dupindex 列来进行简单的数学运算来添加所需的列：

df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']

如何识别 Python pandas Dataframe 中第一次出现的重复行

提问by user1652054

回答by Andy Hayden

回答by Boud

相关推荐

最近更新

标签

如何识别 Python pandas Dataframe 中第一次出现的重复行

提问by user1652054

回答by Andy Hayden

回答by Boud

相关推荐

pandas 使用日期时间索引提高大熊猫 read_csv 的速度

使用整数索引在已排序的 Pandas 系列中按位置访问值

pandas 熊猫数据帧操作中不支持的操作数

pandas 将 Int64Index 更改为 Index 并将 dtype=int64 更改为 dtype=object

相关推荐

最近更新

标签