在 Pandas 中使用 groupby 查找重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33225631/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:04:43  来源:igfitidea点击:

Find duplicates with groupby in Pandas

pythonpandas

提问by GiannisIordanou

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:

我使用 Pandas 读取了一个 csv 文件。然后,我使用下面的代码检查数据中是否有任何重复的行:

import pandas as pd

df= pd.read_csv("data.csv", na_values=["", " ", "-"])

print df.shape
>> (71644, 15)

print df.drop_duplicates().shape
>> (31171, 15)

I find that there are some duplicate rows, so I want to see which rows appear more than once:

我发现有一些重复的行,所以我想看看哪些行出现了不止一次:

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]

Doing that I get Series([], dtype: int64).

这样做我得到Series([], dtype: int64)

Futhermore, I can find the duplicate rows doing the following:

此外,我可以找到重复的行执行以下操作:

duplicates = df[(df.duplicated() == True)]

print duplicates.shape
>> (40473, 15)

So df.drop_duplicates()and df[(df.duplicated() == True)]show that there are duplicate rows but groupbydoesn't.

所以df.drop_duplicates()df[(df.duplicated() == True)]显示有重复的行但groupby没有。

My data consist of strings, integers, floats and nan.

我的数据由字符串、整数、浮点数和 nan 组成。

Have I misunderstood something in the functions I mention above or something else happens ?

我是否误解了我上面提到的功能中的某些内容或发生了其他事情?

回答by Parfait

Simply add the reset_index()to realign aggregates to a new dataframe.

只需添加 将reset_index()聚合重新对齐到新的数据帧。

Additionally, the size()function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.

此外,该size()函数会创建一个未标记的 0 列,您可以使用它来过滤重复行。然后,只需找到结果数据帧的长度,就可以像其他函数一样输出重复的计数:drop_duplicates(), duplicated()==True

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES