Python 检查 Pandas 数据框列中的重复值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50242968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:26:02  来源:igfitidea点击:

Check for duplicate values in Pandas dataframe column

pythonpandasdataframeduplicates

提问by Jeff Mitchell

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows?I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

pandas 中是否有一种方法可以检查数据框列是否具有重复值,而无需实际删除行?我有一个删除重复行的函数,但是,我只希望它在特定列中实际存在重复项时运行。

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

目前,我将列中唯一值的数量与行数进行比较:如果唯一值少于行数,则存在重复项并且代码运行。

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

是否有更简单或更有效的方法来检查特定列中是否存在重复值,使用 Pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

我正在使用的一些示例数据(仅显示两列)。如果找到重复项,则另一个函数会标识要保留的行(日期最早的行):

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Hyman    February 2018
5   Hyman    March 2018

回答by Anton vBR

Main question

主要问题

Is there a duplicate value in a column, True/False?

列中是否有重复值True/ False

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Studentcol by:

假设上面的数据帧(df),我们可以通过以下方式快速检查Studentcol 中是否重复:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True


Further reading and references

进一步阅读和参考

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

上面我们使用的是 Pandas 系列方法之一。pandas DataFrame 有几个有用的方法,其中两个是:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
  1. drop_duplicates(self[,subset,keep, inplace]) -返回删除重复行的 DataFrame,可选择仅考虑某些列。
  2. 重复(self [,子集,保持]) -返回表示重复行的布尔系列,可选择仅考虑某些列。

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

这些方法可以作为一个整体应用在DataFrame上,而不是像上面那样只是一个Serie(列)。相当于:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

但是,如果我们对整个框架感兴趣,我们可以继续执行以下操作:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise 
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keepparamater we can normally skip a few rows directly accessing what we need:

以及最后一个有用的提示。通过使用keep参数,我们通常可以跳过几行直接访问我们需要的内容:

keep : {‘first', ‘last', False}, default ‘first'

保持:{'first', 'last', False},默认为'first'

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.
  • first : 除第一次出现外,删除重复项。
  • last :删除除最后一次出现的重复项。
  • False :删除所有重复项。


Example to play around with

玩的例子

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

退货

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

回答by Carsten

You can use is_unique:

您可以使用is_unique

pd.Series(df['Student']).is_unique

# equals true in case of no duplicates

回答by Katarzyna

If you want to know how many duplicates & what they are use:

如果您想知道有多少重复项以及它们的用途:

df.pivot_table(index=['ColumnName'], aggfunc='size')

df.pivot_table(index=['ColumnName1',.., 'ColumnNameN'], aggfunc='size')

回答by Acumenus

In addition to DataFrame.duplicatedand Series.duplicated, Pandas also has a DataFrame.anyand Series.any.

除了DataFrame.duplicatedand 之外Series.duplicated,Pandas 还有一个DataFrame.anyand Series.any

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

With Python ≥3.8, check for duplicates and access some duplicate rows:

使用 Python ≥3.8,检查重复并访问一些重复的行:

if (duplicated := df.duplicated(keep=False)).any():
    some_duplicates = df[duplicated].sort_values(by=df.columns.to_list()).head()
    print(f"Dataframe has one or more duplicated rows, for example:\n{some_duplicates}")