如何在python中使用pandas获取所有重复项的列表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14657241/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I get a list of all the duplicate items using pandas in python?
提问by BigHandsome
I have a list of items that likely has some export issues. I would like to get a list of the duplicate items so I can manually compare them. When I try to use pandas duplicated method, it only returns the first duplicate. Is there a a way to get all of the duplicates and not just the first one?
我有一份可能存在出口问题的物品清单。我想获得重复项目的列表,以便我可以手动比较它们。当我尝试使用 pandas重复方法时,它只返回第一个重复项。有没有办法得到所有的重复,而不仅仅是第一个?
A small subsection of my dataset looks like this:
我的数据集的一小部分如下所示:
ID,ENROLLMENT_DATE,TRAINER_MANAGING,TRAINER_OPERATOR,FIRST_VISIT_DATE
1536D,12-Feb-12,"06DA1B3-Lebanon NH",,15-Feb-12
F15D,18-May-12,"06405B2-Lebanon NH",,25-Jul-12
8096,8-Aug-12,"0643D38-Hanover NH","0643D38-Hanover NH",25-Jun-12
A036,1-Apr-12,"06CB8CF-Hanover NH","06CB8CF-Hanover NH",9-Aug-12
8944,19-Feb-12,"06D26AD-Hanover NH",,4-Feb-12
1004E,8-Jun-12,"06388B2-Lebanon NH",,24-Dec-11
11795,3-Jul-12,"0649597-White River VT","0649597-White River VT",30-Mar-12
30D7,11-Nov-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",30-Nov-11
3AE2,21-Feb-12,"06405B2-Lebanon NH",,26-Oct-12
B0FE,17-Feb-12,"06D1B9D-Hartland VT",,16-Feb-12
127A1,11-Dec-11,"064456E-Hanover NH","064456E-Hanover NH",11-Nov-12
161FF,20-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",3-Jul-12
A036,30-Nov-11,"063B208-Randolph VT","063B208-Randolph VT",
475B,25-Sep-12,"06D26AD-Hanover NH",,5-Nov-12
151A3,7-Mar-12,"06388B2-Lebanon NH",,16-Nov-12
CA62,3-Jan-12,,,
D31B,18-Dec-11,"06405B2-Lebanon NH",,9-Jan-12
20F5,8-Jul-12,"0669C50-Randolph VT",,3-Feb-12
8096,19-Dec-11,"0649597-White River VT","0649597-White River VT",9-Apr-12
14E48,1-Aug-12,"06D3206-Hanover NH",,
177F8,20-Aug-12,"063B208-Randolph VT","063B208-Randolph VT",5-May-12
553E,11-Oct-12,"06D95A3-Hanover NH","06D95A3-Hanover NH",8-Mar-12
12D5F,18-Jul-12,"0649597-White River VT","0649597-White River VT",2-Nov-12
C6DC,13-Apr-12,"06388B2-Lebanon NH",,
11795,27-Feb-12,"0643D38-Hanover NH","0643D38-Hanover NH",19-Jun-12
17B43,11-Aug-12,,,22-Oct-12
A036,11-Aug-12,"06D3206-Hanover NH",,19-Jun-12
My code looks like this currently:
我的代码目前看起来像这样:
df_bigdata_duplicates = df_bigdata[df_bigdata.duplicated(cols='ID')]
There area a couple duplicate items. But, when I use the above code, I only get the first item. In the API reference, I see how I can get the last item, but I would like to have all of them so I can visually inspect them to see why I am getting the discrepancy. So, in this example I would like to get all three A036 entries and both 11795 entries and any other duplicated entries, instead of the just first one. Any help is most appreciated.
有几个重复的项目。但是,当我使用上面的代码时,我只得到了第一项。在 API 参考中,我看到了如何获得最后一项,但我想拥有所有这些,以便我可以目视检查它们以了解为什么我会得到差异。因此,在此示例中,我想获取所有三个 A036 条目以及 11795 个条目和任何其他重复条目,而不是第一个。非常感谢任何帮助。
采纳答案by DSM
Method #1: print all rows where the ID is one of the IDs in duplicated:
方法 #1:打印 ID 是重复 ID 之一的所有行:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating idsso many times. I prefer method #2: groupbyon the ID.
但我想不出一个很好的方法来防止重复ids这么多次。我更喜欢方法#2:groupby在ID上。
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
回答by Oshbocker
Using an element-wise logical or and setting the take_last argument of the pandas duplicated method to both True and False you can obtain a set from your dataframe that includes all of the duplicates.
使用逐元素逻辑 or 并将 Pandas 重复方法的 take_last 参数设置为 True 和 False ,您可以从包含所有重复项的数据框中获取一个集合。
df_bigdata_duplicates =
df_bigdata[df_bigdata.duplicated(cols='ID', take_last=False) |
df_bigdata.duplicated(cols='ID', take_last=True)
]
回答by user666
With Pandas version 0.17, you can set 'keep = False' in the duplicatedfunction to get all the duplicate items.
使用 Pandas 0.17 版,您可以在重复函数中设置 'keep = False'以获取所有重复项。
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(['a','b','c','d','a','b'])
In [3]: df
Out[3]:
0
0 a
1 b
2 c
3 d
4 a
5 b
In [4]: df[df.duplicated(keep=False)]
Out[4]:
0
0 a
1 b
4 a
5 b
回答by Kelly ChowChow
df[df.duplicated(['ID'], keep=False)]
it'll return all duplicated rows back to you.
它会将所有重复的行返回给您。
According to documentation:
根据文档:
keep : {‘first', ‘last', False}, default ‘first'
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
保持:{'first', 'last', False},默认为'first'
- first :除第一次出现外,将重复项标记为 True。
- last : 除最后一次出现外,将重复项标记为 True。
- False :将所有重复项标记为 True。
回答by Hariprasad
df[df['ID'].duplicated() == True]
This worked for me
这对我有用
回答by yoonghm
This may not be a solution to the question, but to illustrate examples:
这可能不是问题的解决方案,而是举例说明:
import pandas as pd
df = pd.DataFrame({
'A': [1,1,3,4],
'B': [2,2,5,6],
'C': [3,4,7,6],
})
print(df)
df.duplicated(keep=False)
df.duplicated(['A','B'], keep=False)
The outputs:
输出:
A B C
0 1 2 3
1 1 2 4
2 3 5 7
3 4 6 6
0 False
1 False
2 False
3 False
dtype: bool
0 True
1 True
2 False
3 False
dtype: bool
回答by Deepak
As I am unable to comment, hence posting as a separate answer
由于我无法发表评论,因此作为单独的答案发布
To find duplicates on the basis of more than one column, mention every column name as below, and it will return you all the duplicated rows set:
要在多列的基础上查找重复项,请提及每个列名,如下所示,它将返回所有重复的行集:
df[df[['product_uid', 'product_title', 'user']].duplicated() == True]
回答by PREM JILLA
df[df.duplicated(['ID'])==True].sort_values('ID')
df[df.duplicated(['ID'])==True].sort_values('ID')
回答by Nafeez Quraishi
回答by LetzerWille
For my database duplicated(keep=False) did not work until the column was sorted.
对于我的数据库,在对列进行排序之前,duplicated(keep=False) 不起作用。
data.sort_values(by=['Order ID'], inplace=True)
df = data[data['Order ID'].duplicated(keep=False)]

