如何识别 Python pandas Dataframe 中第一次出现的重复行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14952957/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to identify the first occurence of duplicate rows in Python pandas Dataframe
提问by user1652054
I have a pandas DataFrame with duplicate values for a set of columns. For example:
我有一个 Pandas DataFrame,其中包含一组列的重复值。例如:
df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})
In [2]: df
Out[2]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).
第 (1) 行和 (3) 行相同。本质上,第 (3) 行是第 (1) 行的副本。
I am looking for the following output:
我正在寻找以下输出:
Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]
Is_Duplicate, 包含该行是否重复 [可以通过在数据框列(Column2、Column3 和 Column4)上使用“重复”方法来完成]
Dup_Indexthe original index of the duplicate row.
Dup_Index重复行的原始索引。
In [3]: df
Out[3]:
Column1 Column2 Column3 Column4 Is_Duplicate Dup_Index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
回答by Andy Hayden
There is a DataFrame method duplicatedfor the first column:
duplicated第一列有一个 DataFrame 方法:
In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0 False
1 False
2 True
In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])
To do the second you could try something like this:
要做第二件事,你可以尝试这样的事情:
In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])
In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])
In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])
In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])
In [17]: df
Out[17]:
Column1 Column2 Column3 Column4 is_duplicated dup_index
0 1 ABC DEF 10 False 0
1 2 XYZ DEF 40 False 1
2 3 ABC DEF 10 True 0
回答by Boud
Let's say your dataframe is stored in df.
假设您的数据帧存储在df.
You can use groupbyto get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:
您可以使用groupby来获取数据帧的非重复行。这里我们必须忽略不属于数据的 Column1:
df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()
you can then merge this new dataframe with the original one by using the mergefunction:
df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))
You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:
您最终可以使用合并在数据框中的 _dupindex 列来进行简单的数学运算来添加所需的列:
df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']

