在 Pandas 中使用 groupby 查找重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33225631/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find duplicates with groupby in Pandas
提问by GiannisIordanou
I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:
我使用 Pandas 读取了一个 csv 文件。然后,我使用下面的代码检查数据中是否有任何重复的行:
import pandas as pd
df= pd.read_csv("data.csv", na_values=["", " ", "-"])
print df.shape
>> (71644, 15)
print df.drop_duplicates().shape
>> (31171, 15)
I find that there are some duplicate rows, so I want to see which rows appear more than once:
我发现有一些重复的行,所以我想看看哪些行出现了不止一次:
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]
Doing that I get Series([], dtype: int64)
.
这样做我得到Series([], dtype: int64)
。
Futhermore, I can find the duplicate rows doing the following:
此外,我可以找到重复的行执行以下操作:
duplicates = df[(df.duplicated() == True)]
print duplicates.shape
>> (40473, 15)
So df.drop_duplicates()
and df[(df.duplicated() == True)]
show that there are duplicate rows but groupby
doesn't.
所以df.drop_duplicates()
并df[(df.duplicated() == True)]
显示有重复的行但groupby
没有。
My data consist of strings, integers, floats and nan.
我的数据由字符串、整数、浮点数和 nan 组成。
Have I misunderstood something in the functions I mention above or something else happens ?
我是否误解了我上面提到的功能中的某些内容或发生了其他事情?
回答by Parfait
Simply add the reset_index()
to realign aggregates to a new dataframe.
只需添加 将reset_index()
聚合重新对齐到新的数据帧。
Additionally, the size()
function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates()
, duplicated()==True
.
此外,该size()
函数会创建一个未标记的 0 列,您可以使用它来过滤重复行。然后,只需找到结果数据帧的长度,就可以像其他函数一样输出重复的计数:drop_duplicates()
, duplicated()==True
。
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index()
size[size[0] > 1] # DATAFRAME OF DUPLICATES
len(size[size[0] > 1]) # NUMBER OF DUPLICATES