Python 如何使用熊猫查找重复名称?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15247628/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find duplicate names using pandas?
提问by Yariv
I have a pandas.DataFramewith a column called namecontaining strings.
I would like to get a list of the names which occur more than once in the column. How do I do that?
我有一个pandas.DataFrame名为name包含字符串的列。我想获取列中多次出现的名称列表。我怎么做?
I tried:
我试过:
funcs_groups = funcs.groupby(funcs.name)
funcs_groups[(funcs_groups.count().name>1)]
But it doesn't filter out the singleton names.
但它不会过滤掉单例名称。
采纳答案by waitingkuo
If you want to find the rows with duplicated name (except the first time we see that), you can try this
如果您想查找名称重复的行(我们第一次看到的除外),您可以试试这个
In [16]: import pandas as pd
In [17]: p1 = {'name': 'willy', 'age': 10}
In [18]: p2 = {'name': 'willy', 'age': 11}
In [19]: p3 = {'name': 'zoe', 'age': 10}
In [20]: df = pd.DataFrame([p1, p2, p3])
In [21]: df
Out[21]:
age name
0 10 willy
1 11 willy
2 10 zoe
In [22]: df.duplicated('name')
Out[22]:
0 False
1 True
2 False
回答by mkln
I had a similar problem and came across this answer.
我遇到了类似的问题并遇到了这个答案。
I guess this also works:
我想这也有效:
counts = df.groupby('name').size()
df2 = pd.DataFrame(counts, columns = ['size'])
df2 = df2[df2.size>1]
and df2.indexwill give you a list of names with duplicates
并且df2.index会给你一个重复的名字列表
回答by idoda
A one liner can be:
一个班轮可以是:
x.set_index('name').index.get_duplicates()
the index contains a method for finding duplicates, columns does not seem to have a similar method..
索引包含查找重复项的方法,列似乎没有类似的方法..
回答by G Gopi Krishna
Another one liner can be:
另一个班轮可以是:
(df.name).drop_duplicates()
回答by Doctor J
value_countswill give you the number of duplicates as well.
value_counts也会给你重复的数量。
names = df.name.value_counts()
names[names > 1]
回答by noddy
Most of the responses given demonstrate how to remove the duplicates, not find them.
给出的大多数回复都演示了如何删除重复项,而不是找到它们。
The following will select eachrow in the data frame with a duplicate 'name'field. Note that this will find eachinstance, not just duplicates after the first occurrence. The keepargument accepts additional values that can exclude either the first or last occurrence.
以下将选择数据框中具有重复字段的每一行'name'。请注意,这将找到每个实例,而不仅仅是第一次出现后的重复项。该keep参数接受可以排除第一次或最后一次出现的附加值。
df[df.duplicated(['name'], keep=False)]
The pandas reference for duplicated()can be found here.
duplicated()可以在此处找到熊猫参考。

