如何在python中使用pandas获取所有重复项的列表?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14657241/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:05:02  来源:igfitidea点击:

How do I get a list of all the duplicate items using pandas in python?

pythonpandasduplicates

提问by BigHandsome

I have a list of items that likely has some export issues. I would like to get a list of the duplicate items so I can manually compare them. When I try to use pandas duplicated method, it only returns the first duplicate. Is there a a way to get all of the duplicates and not just the first one?

我有一份可能存在出口问题的物品清单。我想获得重复项目的列表,以便我可以手动比较它们。当我尝试使用 pandas重复方法时,它只返回第一个重复项。有没有办法得到所有的重复,而不仅仅是第一个?

A small subsection of my dataset looks like this:

我的数据集的一小部分如下所示:

ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12

My code looks like this currently:

我的代码目前看起来像这样:

df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]

There area a couple duplicate items. But, when I use the above code, I only get the first item. In the API reference, I see how I can get the last item, but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy. So, in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries, instead of the just first one. Any help is most appreciated.

有几个重复的项目。但是,当我使用上面的代码时,我只得到了第一项。在 API 参考中,我看到了如何获得最后一项,但我想拥有所有这些,以便我可以目视检查它们以了解为什么我会得到差异。因此,在此示例中,我想获取所有三个 A036 条目以及 11795 个条目和任何其他重复条目,而不是第一个。非常感谢任何帮助。

采纳答案by DSM

Method #1: print all rows where the ID is one of the IDs in duplicated:

方法 #1:打印 ID 是重复 ID 之一的所有行:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating idsso many times. I prefer method #2: groupbyon the ID.

但我想不出一个很好的方法来防止重复ids这么多次。我更喜欢方法#2:groupby在ID上。

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

回答by Oshbocker

Using an element-wise logical or and setting the take_last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates.

使用逐元素逻辑 or 并将 Pandas 重复方法的 take_last 参数设置为 True 和 False ,您可以从包含所有重复项的数据框中获取一个集合。

df_bigdata_duplicates = 
    df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
               df_bigdata.duplicated(cols='ID', take_last=True)
              ]

回答by user666

With Pandas version 0.17, you can set 'keep = False' in the duplicatedfunction to get all the duplicate items.

使用 Pandas 0.17 版,您可以在重复函数中设置 'keep = False'以获取所有重复项。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])

In [3]: df
Out[3]: 
       0
    0  a
    1  b
    2  c
    3  d
    4  a
    5  b

In [4]: df[df.duplicated(keep=False)]
Out[4]: 
       0
    0  a
    1  b
    4  a
    5  b

回答by Kelly ChowChow

df[df.duplicated(['ID'], keep=False)]

it'll return all duplicated rows back to you.

它会将所有重复的行返回给您。

According to documentation:

根据文档

keep : {‘first', ‘last', False}, default ‘first'

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

保持:{'first', 'last', False},默认为'first'

  • first :除第一次出现外,将重复项标记为 True。
  • last : 除最后一次出现外,将重复项标记为 True。
  • False :将所有重复项标记为 True。

回答by Hariprasad

df[df['ID'].duplicated() == True]

This worked for me

这对我有用

回答by yoonghm

This may not be a solution to the question, but to illustrate examples:

这可能不是问题的解决方案,而是举例说明:

import pandas as pd

df = pd.DataFrame({
    'A': [1,1,3,4],
    'B': [2,2,5,6],
    'C': [3,4,7,6],
})

print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)

The outputs:

输出:

   A  B  C
0  1  2  3
1  1  2  4
2  3  5  7
3  4  6  6

0    False
1    False
2    False
3    False
dtype: bool

0     True
1     True
2    False
3    False
dtype: bool

回答by Deepak

As I am unable to comment, hence posting as a separate answer

由于我无法发表评论,因此作为单独的答案发布

To find duplicates on the basis of more than one column, mention every column name as below, and it will return you all the duplicated rows set:

要在多列的基础上查找重复项,请提及每个列名,如下所示,它将返回所有重复的行集:

df[df[['product_uid', 'product_title', 'user']].duplicated() == True]

回答by PREM JILLA

df[df.duplicated(['ID'])==True].sort_values('ID')

df[df.duplicated(['ID'])==True].sort_values('ID')

回答by Nafeez Quraishi

sort("ID")does not seem to be working now, seems deprecated as per sort doc, so use sort_values("ID")instead to sort after duplicate filter, as following:

sort("ID")现在似乎没有工作,似乎按照sort doc已弃用,因此请改用sort_values("ID")在重复过滤器之后进行排序,如下所示:

df[df.ID.duplicated(keep=False)].sort_values("ID")

回答by LetzerWille

For my database duplicated(keep=False) did not work until the column was sorted.

对于我的数据库,在对列进行排序之前,duplicated(keep=False) 不起作用。

data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]