在 Pandas DataFrame 中查找重复行的索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46629518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:36:07  来源:igfitidea点击:

Find indices of duplicate rows in pandas DataFrame

pythonpandasdataframe

提问by Genius

What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows?

在不迭代单个行的情况下,在给定的 DataFrame 中查找相同行的索引的 Pandas 方法是什么?

While it is possible to find all unique rows with unique = df[df.duplicated()]and then iterating over the unique entries with unique.iterrows()and extracting the indices of equal entries with help of pd.where(), what is the pandas way of doing it?

虽然可以找到所有唯一的行,unique = df[df.duplicated()]然后使用 迭代唯一条目unique.iterrows()并在 的帮助下提取相等条目的索引,pd.where()但Pandas的做法是什么?

Example:Given a DataFrame of the following structure:

示例:给定具有以下结构的 DataFrame:

  | param_a | param_b | param_c
1 | 0       | 0       | 0
2 | 0       | 2       | 1
3 | 2       | 1       | 1
4 | 0       | 2       | 1
5 | 2       | 1       | 1
6 | 0       | 0       | 0

Output:

输出:

[(1, 6), (2, 4), (3, 5)]

回答by jezrael

Use parameter duplicatedwith keep=Falsefor all dupe rows and then groupbyby all columns and convert index values to tuples, last convert output Seriesto list:

对所有重复行使用参数duplicatedwithkeep=False然后groupby对所有列使用参数并将索引值转换为元组,最后将输出转换Serieslist

df = df[df.duplicated(keep=False)]

df = df.groupby(list(df).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]

If you want also see dupe values:

如果您还想查看重复值:

df1 = (df.groupby(df.columns.tolist())
       .apply(lambda x: tuple(x.index))
       .reset_index(name='idx'))
print (df1)
   param_a  param_b  param_c     idx
0        0        0        0  (1, 6)
1        0        2        1  (2, 4)
2        2        1        1  (3, 5)

回答by Divakar

Approach #1

方法#1

Here's one vectorized approach inspired by this post-

这是一种受以下启发的矢量化方法this post-

def group_duplicate_index(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    I = df.index[sidx].tolist()       
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample run -

样品运行 -

In [42]: df
Out[42]: 
   param_a  param_b  param_c
1        0        0        0
2        0        2        1
3        2        1        1
4        0        2        1
5        2        1        1
6        0        0        0

In [43]: group_duplicate_index(df)
Out[43]: [[1, 6], [3, 5], [2, 4]]

Approach #2

方法#2

For integer numbered dataframes, we could reduce each row to a scalar each and that lets us work with a 1Darray, giving us a more performant one, like so -

对于整数编号的数据帧,我们可以将每一行都减少为一个标量,这让我们可以使用1D数组,从而获得更高性能的数组,如下所示 -

def group_duplicate_index_v2(df):
    a = df.values
    s = (a.max()+1)**np.arange(df.shape[1])
    sidx = a.dot(s).argsort()
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    I = df.index[sidx].tolist() 
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]


Runtime test

运行时测试

Other approach(es) -

其他方法 -

def groupby_app(df): # @jezrael's soln
    df = df[df.duplicated(keep=False)]
    df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
    return df

Timings -

时间 -

In [274]: df = pd.DataFrame(np.random.randint(0,10,(100000,3)))

In [275]: %timeit group_duplicate_index(df)
10 loops, best of 3: 36.1 ms per loop

In [276]: %timeit group_duplicate_index_v2(df)
100 loops, best of 3: 15 ms per loop

In [277]: %timeit groupby_app(df) # @jezrael's soln
10 loops, best of 3: 25.9 ms per loop