在 Pandas 数据框中查找所有重复的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42903945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find all duplicate rows in a pandas dataframe
提问by Nico
I would like to be able to get the indices of all the instances of a duplicated row in a dataset without knowing the name and number of columns beforehand. So assume I have this:
我希望能够获得数据集中重复行的所有实例的索引,而无需事先知道列的名称和数量。所以假设我有这个:
col
1 | 1
2 | 2
3 | 1
4 | 1
5 | 2
I'd like to be able to get [1, 3, 4]
and [2, 5]
. Is there any way to achieve this? It sounds really simple, but since I don't know the columns beforehand I can't do something like df[col == x...]
.
我希望能够获得[1, 3, 4]
和[2, 5]
。有没有办法实现这一目标?这听起来很简单,但因为我事先不知道列,所以我不能做类似df[col == x...]
.
回答by jezrael
First filter all duplicated
rows and then groupby
with apply
or convert index
to_series
:
首先过滤所有duplicated
行,然后 groupby
使用apply
或转换index
to_series
:
df = df[df.col.duplicated(keep=False)]
a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object
And if need nested lists:
如果需要嵌套列表:
L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]
If need use only first column is possible selected by position with iloc
:
如果需要仅使用第一列,则可以按位置选择iloc
:
a = df[df.iloc[:,0].duplicated(keep=False)]
.groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1 [1, 3, 4]
2 [2, 5]
dtype: object