Python 如何删除熊猫数据框中具有重复列值的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50885093/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:37:45  来源:igfitidea点击:

how do I remove rows with duplicate values of columns in pandas data frame?

pythonpandas

提问by Sayonti

I have a pandas data frame which looks like this.

我有一个看起来像这样的熊猫数据框。

'Column1' 'Column2' 'Column3'
'cat'     'bat'.    'xyz'
'toy'    'flower'.  'abc'
'cat'    'bat'      'lmn'

I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.

我想确定 cat 和 bat 是重复的相同值,因此想要删除一条记录并仅保留第一条记录。结果数据框应该只有。

'Column1'  'Column2' 'Column3'
'cat'.     'bat'.     'xyz'
'toy'.     'flower'.  'abc'   

回答by student

Using drop_duplicateswith subsetwith list of columns to check for duplicates on and keep='first'to keep first of duplicates.

使用drop_duplicates具有subset与列的列表上检查重复和keep='first'保持第一重复的。

If dataframeis:

如果dataframe是:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

结果:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

然后:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

结果:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

回答by zafrin

import pandas as pd

df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
                    "Column2":[1,1,1],
                    "Column3":["C","A","B"]})

df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)

回答by Jay Dangar

Inside the drop_duplicates()method of Dataframeyou can provide a series of column names to eliminate duplicate records from your data.

drop_duplicates()方法内部,Dataframe您可以提供一系列列名以消除数据中的重复记录。

The following "Tested" code does the same :

以下“已测试”代码执行相同操作:

import pandas as pd

df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat',     'toy',    'cat'])
df.insert(loc=1,column='Column2',value=['bat',    'flower',  'bat'])
df.insert(loc=2,column='Column3',value=['xyz',     'abc',    'lmn'])

df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)

Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-

在子集参数内部,您也可以插入其他列名,默认情况下它会考虑数据的所有列,您可以提供保留值:-

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.
  • first : 除第一次出现外,删除重复项。
  • last :删除除最后一次出现的重复项。
  • False :删除所有重复项。