pandas 熊猫：删除连续的重复项

Question

提问by Thomas Johnson

What's the most efficient way to drop only consecutive duplicates in pandas?

在Pandas中只删除连续重复项的最有效方法是什么？

drop_duplicates gives this:

drop_duplicates 给出了这个：

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

But I want this:

但我想要这个：

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

Answer 1

回答by EdChum

Use shift:

使用shift：

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

所以上面使用布尔标准，我们将数据帧与移动了 -1 行的数据帧进行比较以创建掩码

Another method is to use diff:

另一种方法是使用diff：

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

但是如果您有大量行，这比原始方法慢。

Update

更新

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)or just shift()as the default is a period of 1, this returns the first consecutive value:

感谢 Bjarke Ebert 指出了一个微妙的错误，我应该实际使用shift(1)或者就像shift()默认值为 1 一样，这将返回第一个连续值：

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

请注意索引值的差异，谢谢@BjarkeEbert！

Answer 2

回答by johnml1135

Here is an update that will make it work with multiple columns. Use ".any(axis=1)" to combine the results from each column:

这是一个更新，可以使其适用于多列。使用“.any(axis=1)”组合每一列的结果：

cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]

Answer 3

回答by Divakar

Since we are going for most efficient way, i.e. performance, let's use array data to leverage NumPy. We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post. But with NumPy slicing we would end up with one-less array, so we need to concatenate with a Trueelement at the start to select the first element and hence we would have an implementation like so -

由于我们追求的是most efficient way性能，因此让我们使用数组数据来利用 NumPy。我们将切片一次性切片并进行比较，类似于前面讨论的移位方法@EdChum's post。但是使用 NumPy 切片我们最终会得到一个无一个数组，所以我们需要True在开始时连接一个元素来选择第一个元素，因此我们会有一个像这样的实现 -

def drop_consecutive_duplicates(a):
    ar = a.values
    return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

样品运行 -

In [149]: a
Out[149]: 
1    1
2    2
3    2
4    3
5    2
dtype: int64

In [150]: drop_consecutive_duplicates(a)
Out[150]: 
1    1
2    2
4    3
5    2
dtype: int64

Timings on large arrays comparing @EdChum's solution-

大型阵列的时序比较@EdChum's solution-

In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))

In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop

In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop

In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop

In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop

So, there's some improvement!

所以，有一些改进！

Get major boost for values only!

只为价值获得重大提升！

If only the values are needed, we could get major boost by simply indexing into the array data, like so -

如果只需要值，我们可以通过简单地索引到数组数据来获得重大提升，就像这样 -

def drop_consecutive_duplicates(a):
    ar = a.values
    return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]

Sample run -

样品运行 -

In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])

Timings -

时间 -

In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))

In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop

In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop

Answer 4

回答by Arthur D. Howland

For other Stack explorers, building off johnml1135's answer above. This will remove the next duplicate from multiple columns but not drop all of the columns. When the dataframe is sorted it will keep the first row but drop the second row if the "cols" match, even if there are more columns with non-matching information.

对于其他堆栈资源管理器，构建上面 johnml1135 的答案。这将从多个列中删除下一个重复项，但不会删除所有列。对数据框进行排序时，它会保留第一行，但如果“cols”匹配，则删除第二行，即使有更多具有不匹配信息的列。

cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]

Answer 5

回答by Antoine Collet

Here is a function that handles both pd.Seriesand pd.Dataframes. You can mask/drop, choose the axis and finaly choose to drop with 'any' or 'all' 'NaN'. It is not optimized in term of computation time, but it has the advantage to be robust and pretty clear.

这是一个处理pd.Series和的函数pd.Dataframes。您可以屏蔽/删除，选择轴并最终选择删除“任何”或“全部”“NaN”。它没有在计算时间方面进行优化，但它具有鲁棒性和非常清晰的优点。

import numpy as np
import pandas as pd

# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop = False, 
                                             keep_first = True,
                                             axis = 0, how = 'all'):

    '''
    #Function built with the help of:
    # 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
    # 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates

    Input:
    df should be a pandas.DataFrame of a a pandas.Series
    Output:
    df of ts with masked or droped values
    '''

    # Mask keeping the first occurence
    if keep_first == True:
        df = df.mask(df.shift(1) == df)
    # Mask including the first occurence
    elif keep_first == False:
        df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))

    # Only mask the values (e.g. become 'NaN')
    if drop == False:
        return df

    # Drop the values (e.g. rows are deleted)
    else:
        return df.dropna(axis = axis, how = how)

Here is a test code to include in the script:

这是要包含在脚本中的测试代码：


if __name__ == "__main__":

    # With time series
    print("With time series:\n")
    ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')], 
                    index=[0,1,2,3,4,5,6,7,8,9,10,11,12])

    print("#Original ts:")    
    print(ts)

    print("\n## 1) Mask keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = False, 
                                                   keep_first = True))

    print("\n## 2) Mask including the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = False, 
                                                   keep_first = False))

    print("\n## 3) Drop keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = True, 
                                                   keep_first = True))

    print("\n## 4) Drop including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = True, 
                                                   keep_first = False))


    # With dataframes
    print("With dataframe:\n")
    df = pd.DataFrame(np.random.randn(15, 3))
    df.iloc[4:9,0]=40
    df.iloc[8:15,1]=22
    df.iloc[8:12,2]=0.23

    print("#Original df:")
    print(df)

    print("\n## 5) Mask keeping the first occurence:") 
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = False, 
                                                   keep_first = True))

    print("\n## 6) Mask including the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = False, 
                                                   keep_first = False))

    print("\n## 7) Drop 'any' keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True, 
                                                   keep_first = True,
                                                   how = 'any'))

    print("\n## 8) Drop 'all' keeping the first occurence:")    
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True, 
                                                   keep_first = True,
                                                   how = 'all'))

    print("\n## 9) Drop 'any' including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True, 
                                                   keep_first = False,
                                                   how = 'any'))

    print("\n## 10) Drop 'all' including the first occurence:")        
    print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True, 
                                                   keep_first = False,
                                                   how = 'all'))

And here is the expected result:

这是预期的结果：

With time series:

#Original ts:
0     1.0
1     1.0
2     2.0
3     2.0
4     3.0
5     2.0
6     6.0
7     6.0
8     NaN
9     6.0
10    6.0
11    NaN
12    NaN
dtype: float64

## 1) Mask keeping the first occurence:
0     1.0
1     NaN
2     2.0
3     NaN
4     3.0
5     2.0
6     6.0
7     NaN
8     NaN
9     6.0
10    NaN
11    NaN
12    NaN
dtype: float64

## 2) Mask including the first occurence:
0     NaN
1     NaN
2     NaN
3     NaN
4     3.0
5     2.0
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
dtype: float64

## 3) Drop keeping the first occurence:
0    1.0
2    2.0
4    3.0
5    2.0
6    6.0
9    6.0
dtype: float64

## 4) Drop including the first occurence:
4    3.0
5    2.0
dtype: float64
With dataframe:

#Original df:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5   40.000000  -0.470958 -0.339213
6   40.000000   1.613524  0.271641
7   40.000000  -1.810958 -1.568372
8   40.000000  22.000000  0.230000
9   -0.296557  22.000000  0.230000
10  -0.921238  22.000000  0.230000
11  -0.170195  22.000000  0.230000
12   1.460457  22.000000 -0.295418
13   0.307825  22.000000 -0.759131
14   0.287392  22.000000  0.378315

## 5) Mask keeping the first occurence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 6) Mask including the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
8        NaN       NaN       NaN
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

## 7) Drop 'any' keeping the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4  40.000000  1.889781 -1.394573

## 8) Drop 'all' keeping the first occurence:
            0          1         2
0   -1.890137  -3.125224 -1.029065
1   -0.224712  -0.194742  1.891365
2    1.009388   0.589445  0.927405
3    0.212746  -0.392314 -0.781851
4   40.000000   1.889781 -1.394573
5         NaN  -0.470958 -0.339213
6         NaN   1.613524  0.271641
7         NaN  -1.810958 -1.568372
8         NaN  22.000000  0.230000
9   -0.296557        NaN       NaN
10  -0.921238        NaN       NaN
11  -0.170195        NaN       NaN
12   1.460457        NaN -0.295418
13   0.307825        NaN -0.759131
14   0.287392        NaN  0.378315

## 9) Drop 'any' including the first occurence:
          0         1         2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742  1.891365
2  1.009388  0.589445  0.927405
3  0.212746 -0.392314 -0.781851

## 10) Drop 'all' including the first occurence:
           0         1         2
0  -1.890137 -3.125224 -1.029065
1  -0.224712 -0.194742  1.891365
2   1.009388  0.589445  0.927405
3   0.212746 -0.392314 -0.781851
4        NaN  1.889781 -1.394573
5        NaN -0.470958 -0.339213
6        NaN  1.613524  0.271641
7        NaN -1.810958 -1.568372
9  -0.296557       NaN       NaN
10 -0.921238       NaN       NaN
11 -0.170195       NaN       NaN
12  1.460457       NaN -0.295418
13  0.307825       NaN -0.759131
14  0.287392       NaN  0.378315

pandas 熊猫：删除连续的重复项

提问by Thomas Johnson

回答by EdChum

回答by johnml1135

回答by Divakar

回答by Arthur D. Howland

回答by Antoine Collet

相关推荐

最近更新

标签

pandas 熊猫：删除连续的重复项

提问by Thomas Johnson

回答by EdChum

回答by johnml1135

回答by Divakar

回答by Arthur D. Howland

回答by Antoine Collet

相关推荐

pandas 从数据框中获取满足熊猫中条件的行

Pandas 合并错误：MemoryError

pandas 在 DataFrame 对象上使用滚动应用

pandas - 扩展 DataFrame 的索引将新行的所有列设置为 NaN？

相关推荐

最近更新

标签