在 Pandas 中使用 groupby 查找重复项

Question

提问by GiannisIordanou

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:

我使用 Pandas 读取了一个 csv 文件。然后，我使用下面的代码检查数据中是否有任何重复的行：

import pandas as pd

df= pd.read_csv("data.csv", na_values=["", " ", "-"])

print df.shape
>> (71644, 15)

print df.drop_duplicates().shape
>> (31171, 15)

I find that there are some duplicate rows, so I want to see which rows appear more than once:

我发现有一些重复的行，所以我想看看哪些行出现了不止一次：

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]

Doing that I get Series([], dtype: int64).

这样做我得到Series([], dtype: int64)。

Futhermore, I can find the duplicate rows doing the following:

此外，我可以找到重复的行执行以下操作：

duplicates = df[(df.duplicated() == True)]

print duplicates.shape
>> (40473, 15)

So df.drop_duplicates()and df[(df.duplicated() == True)]show that there are duplicate rows but groupbydoesn't.

所以df.drop_duplicates()并df[(df.duplicated() == True)]显示有重复的行但groupby没有。

My data consist of strings, integers, floats and nan.

我的数据由字符串、整数、浮点数和 nan 组成。

Have I misunderstood something in the functions I mention above or something else happens ?

我是否误解了我上面提到的功能中的某些内容或发生了其他事情？

Answer 1

回答by Parfait

Simply add the reset_index()to realign aggregates to a new dataframe.

只需添加将reset_index()聚合重新对齐到新的数据帧。

Additionally, the size()function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.

此外，该size()函数会创建一个未标记的 0 列，您可以使用它来过滤重复行。然后，只需找到结果数据帧的长度，就可以像其他函数一样输出重复的计数：drop_duplicates(), duplicated()==True。

data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index() 
size[size[0] > 1]        # DATAFRAME OF DUPLICATES

len(size[size[0] > 1])   # NUMBER OF DUPLICATES

在 Pandas 中使用 groupby 查找重复项

提问by GiannisIordanou

回答by Parfait

相关推荐

最近更新

标签

在 Pandas 中使用 groupby 查找重复项

提问by GiannisIordanou

回答by Parfait

相关推荐

在 azure ml 中运行笔记本时，如何最好地将 azure blob csv 格式转换为 Pandas 数据帧

pandas 基于给定分布对数据帧进行采样

pandas 与 scipy 中的偏斜和峰态函数有什么区别？

pandas 将 Int64Index 转换为 Int

相关推荐

最近更新

标签