Python 检查 Pandas 数据框列中的重复值

Question

提问by Jeff Mitchell

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows?I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

pandas 中是否有一种方法可以检查数据框列是否具有重复值，而无需实际删除行？我有一个删除重复行的函数，但是，我只希望它在特定列中实际存在重复项时运行。

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

目前，我将列中唯一值的数量与行数进行比较：如果唯一值少于行数，则存在重复项并且代码运行。

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

是否有更简单或更有效的方法来检查特定列中是否存在重复值，使用 Pandas？

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

我正在使用的一些示例数据（仅显示两列）。如果找到重复项，则另一个函数会标识要保留的行（日期最早的行）：

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Hyman    February 2018
5   Hyman    March 2018

Answer 1

回答by Anton vBR

Main question

主要问题

Is there a duplicate value in a column, True/False?

列中是否有重复值True/ False？

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Studentcol by:

假设上面的数据帧（df），我们可以通过以下方式快速检查Studentcol 中是否重复：

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

进一步阅读和参考

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

上面我们使用的是 Pandas 系列方法之一。pandas DataFrame 有几个有用的方法，其中两个是：

drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

drop_duplicates(self[,subset,keep, inplace]) -返回删除重复行的 DataFrame，可选择仅考虑某些列。
重复（self [，子集，保持]） -返回表示重复行的布尔系列，可选择仅考虑某些列。

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

这些方法可以作为一个整体应用在DataFrame上，而不是像上面那样只是一个Serie（列）。相当于：

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

但是，如果我们对整个框架感兴趣，我们可以继续执行以下操作：

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise 
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keepparamater we can normally skip a few rows directly accessing what we need:

以及最后一个有用的提示。通过使用keep参数，我们通常可以跳过几行直接访问我们需要的内容：

keep : {‘first', ‘last', False}, default ‘first'

保持：{'first', 'last', False}，默认为'first'

first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

first : 除第一次出现外，删除重复项。
last ：删除除最后一次出现的重复项。
False ：删除所有重复项。

Example to play around with

玩的例子

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

退货

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

Answer 2

回答by Carsten

You can use is_unique:

您可以使用is_unique：

pd.Series(df['Student']).is_unique

# equals true in case of no duplicates

Answer 3

回答by Katarzyna

If you want to know how many duplicates & what they are use:

如果您想知道有多少重复项以及它们的用途：

df.pivot_table(index=['ColumnName'], aggfunc='size')

df.pivot_table(index=['ColumnName1',.., 'ColumnNameN'], aggfunc='size')

Answer 4

回答by Acumenus

In addition to DataFrame.duplicatedand Series.duplicated, Pandas also has a DataFrame.anyand Series.any.

除了DataFrame.duplicatedand 之外Series.duplicated，Pandas 还有一个DataFrame.anyand Series.any。

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

With Python ≥3.8, check for duplicates and access some duplicate rows:

使用 Python ≥3.8，检查重复并访问一些重复的行：

if (duplicated := df.duplicated(keep=False)).any():
    some_duplicates = df[duplicated].sort_values(by=df.columns.to_list()).head()
    print(f"Dataframe has one or more duplicated rows, for example:\n{some_duplicates}")

Python 检查 Pandas 数据框列中的重复值

提问by Jeff Mitchell

回答by Anton vBR

Main question

主要问题

Further reading and references

进一步阅读和参考

Example to play around with

玩的例子

回答by Carsten

回答by Katarzyna

回答by Acumenus

相关推荐

最近更新

标签

Python 检查 Pandas 数据框列中的重复值

提问by Jeff Mitchell

回答by Anton vBR

Main question

主要问题

Further reading and references

进一步阅读和参考

Example to play around with

玩的例子

回答by Carsten

回答by Katarzyna

回答by Acumenus

相关推荐

Python 如何将图像加载到 Pytorch DataLoader 中？

Python 在具有部分字符串匹配的目录中查找文件

Python Keras 中的“无法解释优化器标识符”错误

Python Plt.show 显示完整图形，但 savefig 正在裁剪图像

相关推荐

最近更新

标签