Python 如何使用熊猫查找重复名称？

Question

提问by Yariv

I have a pandas.DataFramewith a column called namecontaining strings. I would like to get a list of the names which occur more than once in the column. How do I do that?

我有一个pandas.DataFrame名为name包含字符串的列。我想获取列中多次出现的名称列表。我怎么做？

I tried:

我试过：

funcs_groups = funcs.groupby(funcs.name)
funcs_groups[(funcs_groups.count().name>1)]

But it doesn't filter out the singleton names.

但它不会过滤掉单例名称。

Answer 1

采纳答案by waitingkuo

If you want to find the rows with duplicated name (except the first time we see that), you can try this

如果您想查找名称重复的行（我们第一次看到的除外），您可以试试这个

In [16]: import pandas as pd
In [17]: p1 = {'name': 'willy', 'age': 10}
In [18]: p2 = {'name': 'willy', 'age': 11}
In [19]: p3 = {'name': 'zoe', 'age': 10}
In [20]: df = pd.DataFrame([p1, p2, p3])

In [21]: df
Out[21]: 
   age   name
0   10  willy
1   11  willy
2   10    zoe

In [22]: df.duplicated('name')
Out[22]: 
0    False
1     True
2    False

Answer 2

回答by mkln

I had a similar problem and came across this answer.

我遇到了类似的问题并遇到了这个答案。

I guess this also works:

我想这也有效：

counts = df.groupby('name').size()
df2 = pd.DataFrame(counts, columns = ['size'])
df2 = df2[df2.size>1]

and df2.indexwill give you a list of names with duplicates

并且df2.index会给你一个重复的名字列表

Answer 3

回答by idoda

A one liner can be:

一个班轮可以是：

x.set_index('name').index.get_duplicates()

the index contains a method for finding duplicates, columns does not seem to have a similar method..

索引包含查找重复项的方法，列似乎没有类似的方法..

Answer 4

回答by G Gopi Krishna

Another one liner can be:

另一个班轮可以是：

(df.name).drop_duplicates()

Answer 5

回答by Doctor J

value_countswill give you the number of duplicates as well.

value_counts也会给你重复的数量。

names = df.name.value_counts()
names[names > 1]

Answer 6

回答by noddy

Most of the responses given demonstrate how to remove the duplicates, not find them.

给出的大多数回复都演示了如何删除重复项，而不是找到它们。

The following will select eachrow in the data frame with a duplicate 'name'field. Note that this will find eachinstance, not just duplicates after the first occurrence. The keepargument accepts additional values that can exclude either the first or last occurrence.

以下将选择数据框中具有重复字段的每一行'name'。请注意，这将找到每个实例，而不仅仅是第一次出现后的重复项。该keep参数接受可以排除第一次或最后一次出现的附加值。

df[df.duplicated(['name'], keep=False)]

The pandas reference for duplicated()can be found here.

duplicated()可以在此处找到熊猫参考。

Python 如何使用熊猫查找重复名称？

提问by Yariv

采纳答案by waitingkuo

回答by mkln

回答by idoda

回答by G Gopi Krishna

回答by Doctor J

回答by noddy

相关推荐

最近更新

标签

Python 如何使用熊猫查找重复名称？

提问by Yariv

采纳答案by waitingkuo

回答by mkln

回答by idoda

回答by G Gopi Krishna

回答by Doctor J

回答by noddy

相关推荐

Python 使用 scikit-learn 的 Imputer 模块预测缺失值

Python + 正则表达式：AttributeError：'NoneType' 对象没有属性 'groups'

如何在python日志记录中插入换行符？

如何将一个字符与 Python 中某个字符串中的所有字符进行比较？

相关推荐

最近更新

标签