pandas 熊猫:删除连续的重复项
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19463985/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: Drop consecutive duplicates
提问by Thomas Johnson
What's the most efficient way to drop only consecutive duplicates in pandas?
在Pandas中只删除连续重复项的最有效方法是什么?
drop_duplicates gives this:
drop_duplicates 给出了这个:
In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [4]: a.drop_duplicates()
Out[4]:
1 1
2 2
4 3
dtype: int64
But I want this:
但我想要这个:
In [4]: a.something()
Out[4]:
1 1
2 2
4 3
5 2
dtype: int64
回答by EdChum
Use shift:
使用shift:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
所以上面使用布尔标准,我们将数据帧与移动了 -1 行的数据帧进行比较以创建掩码
Another method is to use diff:
另一种方法是使用diff:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
But this is slower than the original method if you have a large number of rows.
但是如果您有大量行,这比原始方法慢。
Update
更新
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)or just shift()as the default is a period of 1, this returns the first consecutive value:
感谢 Bjarke Ebert 指出了一个微妙的错误,我应该实际使用shift(1)或者就像shift()默认值为 1 一样,这将返回第一个连续值:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
Note the difference in index values, thanks @BjarkeEbert!
请注意索引值的差异,谢谢@BjarkeEbert!
回答by johnml1135
Here is an update that will make it work with multiple columns. Use ".any(axis=1)" to combine the results from each column:
这是一个更新,可以使其适用于多列。使用“.any(axis=1)”组合每一列的结果:
cols = ["col1","col2","col3"]
de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]
回答by Divakar
Since we are going for most efficient way, i.e. performance, let's use array data to leverage NumPy. We will slice one-off slices and compare, similar to shifting method discussed earlier in @EdChum's post. But with NumPy slicing we would end up with one-less array, so we need to concatenate with a Trueelement at the start to select the first element and hence we would have an implementation like so -
由于我们追求的是most efficient way性能,因此让我们使用数组数据来利用 NumPy。我们将切片一次性切片并进行比较,类似于前面讨论的移位方法@EdChum's post。但是使用 NumPy 切片我们最终会得到一个无一个数组,所以我们需要True在开始时连接一个元素来选择第一个元素,因此我们会有一个像这样的实现 -
def drop_consecutive_duplicates(a):
ar = a.values
return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -
样品运行 -
In [149]: a
Out[149]:
1 1
2 2
3 2
4 3
5 2
dtype: int64
In [150]: drop_consecutive_duplicates(a)
Out[150]:
1 1
2 2
4 3
5 2
dtype: int64
Timings on large arrays comparing @EdChum's solution-
大型阵列的时序比较@EdChum's solution-
In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))
In [143]: %timeit a.loc[a.shift() != a]
100 loops, best of 3: 12.1 ms per loop
In [144]: %timeit drop_consecutive_duplicates(a)
100 loops, best of 3: 11 ms per loop
In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [146]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 136 ms per loop
In [147]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 114 ms per loop
So, there's some improvement!
所以,有一些改进!
Get major boost for values only!
只为价值获得重大提升!
If only the values are needed, we could get major boost by simply indexing into the array data, like so -
如果只需要值,我们可以通过简单地索引到数组数据来获得重大提升,就像这样 -
def drop_consecutive_duplicates(a):
ar = a.values
return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]
Sample run -
样品运行 -
In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [171]: drop_consecutive_duplicates(a)
Out[171]: array([1, 2, 3, 2])
Timings -
时间 -
In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))
In [174]: %timeit a.loc[a.shift() != a]
10 loops, best of 3: 137 ms per loop
In [175]: %timeit drop_consecutive_duplicates(a)
10 loops, best of 3: 61.3 ms per loop
回答by Arthur D. Howland
For other Stack explorers, building off johnml1135's answer above. This will remove the next duplicate from multiple columns but not drop all of the columns. When the dataframe is sorted it will keep the first row but drop the second row if the "cols" match, even if there are more columns with non-matching information.
对于其他堆栈资源管理器,构建上面 johnml1135 的答案。这将从多个列中删除下一个重复项,但不会删除所有列。对数据框进行排序时,它会保留第一行,但如果“cols”匹配,则删除第二行,即使有更多具有不匹配信息的列。
cols = ["col1","col2","col3"]
df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]
回答by Antoine Collet
Here is a function that handles both pd.Seriesand pd.Dataframes. You can mask/drop, choose the axis and finaly choose to drop with 'any' or 'all' 'NaN'. It is not optimized in term of computation time, but it has the advantage to be robust and pretty clear.
这是一个处理pd.Series和的函数pd.Dataframes。您可以屏蔽/删除,选择轴并最终选择删除“任何”或“全部”“NaN”。它没有在计算时间方面进行优化,但它具有鲁棒性和非常清晰的优点。
import numpy as np
import pandas as pd
# To mask/drop successive values in pandas
def Mask_Or_Drop_Successive_Identical_Values(df, drop = False,
keep_first = True,
axis = 0, how = 'all'):
'''
#Function built with the help of:
# 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
# 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
Input:
df should be a pandas.DataFrame of a a pandas.Series
Output:
df of ts with masked or droped values
'''
# Mask keeping the first occurence
if keep_first == True:
df = df.mask(df.shift(1) == df)
# Mask including the first occurence
elif keep_first == False:
df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))
# Only mask the values (e.g. become 'NaN')
if drop == False:
return df
# Drop the values (e.g. rows are deleted)
else:
return df.dropna(axis = axis, how = how)
Here is a test code to include in the script:
这是要包含在脚本中的测试代码:
if __name__ == "__main__":
# With time series
print("With time series:\n")
ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')],
index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
print("#Original ts:")
print(ts)
print("\n## 1) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = False,
keep_first = True))
print("\n## 2) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = False,
keep_first = False))
print("\n## 3) Drop keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = True,
keep_first = True))
print("\n## 4) Drop including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(ts, drop = True,
keep_first = False))
# With dataframes
print("With dataframe:\n")
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:9,0]=40
df.iloc[8:15,1]=22
df.iloc[8:12,2]=0.23
print("#Original df:")
print(df)
print("\n## 5) Mask keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = False,
keep_first = True))
print("\n## 6) Mask including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = False,
keep_first = False))
print("\n## 7) Drop 'any' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True,
keep_first = True,
how = 'any'))
print("\n## 8) Drop 'all' keeping the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True,
keep_first = True,
how = 'all'))
print("\n## 9) Drop 'any' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True,
keep_first = False,
how = 'any'))
print("\n## 10) Drop 'all' including the first occurence:")
print(Mask_Or_Drop_Successive_Identical_Values(df, drop = True,
keep_first = False,
how = 'all'))
And here is the expected result:
这是预期的结果:
With time series:
#Original ts:
0 1.0
1 1.0
2 2.0
3 2.0
4 3.0
5 2.0
6 6.0
7 6.0
8 NaN
9 6.0
10 6.0
11 NaN
12 NaN
dtype: float64
## 1) Mask keeping the first occurence:
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 2.0
6 6.0
7 NaN
8 NaN
9 6.0
10 NaN
11 NaN
12 NaN
dtype: float64
## 2) Mask including the first occurence:
0 NaN
1 NaN
2 NaN
3 NaN
4 3.0
5 2.0
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
dtype: float64
## 3) Drop keeping the first occurence:
0 1.0
2 2.0
4 3.0
5 2.0
6 6.0
9 6.0
dtype: float64
## 4) Drop including the first occurence:
4 3.0
5 2.0
dtype: float64
With dataframe:
#Original df:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 40.000000 -0.470958 -0.339213
6 40.000000 1.613524 0.271641
7 40.000000 -1.810958 -1.568372
8 40.000000 22.000000 0.230000
9 -0.296557 22.000000 0.230000
10 -0.921238 22.000000 0.230000
11 -0.170195 22.000000 0.230000
12 1.460457 22.000000 -0.295418
13 0.307825 22.000000 -0.759131
14 0.287392 22.000000 0.378315
## 5) Mask keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 6) Mask including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN NaN NaN
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 7) Drop 'any' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
## 8) Drop 'all' keeping the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 40.000000 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
8 NaN 22.000000 0.230000
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315
## 9) Drop 'any' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
## 10) Drop 'all' including the first occurence:
0 1 2
0 -1.890137 -3.125224 -1.029065
1 -0.224712 -0.194742 1.891365
2 1.009388 0.589445 0.927405
3 0.212746 -0.392314 -0.781851
4 NaN 1.889781 -1.394573
5 NaN -0.470958 -0.339213
6 NaN 1.613524 0.271641
7 NaN -1.810958 -1.568372
9 -0.296557 NaN NaN
10 -0.921238 NaN NaN
11 -0.170195 NaN NaN
12 1.460457 NaN -0.295418
13 0.307825 NaN -0.759131
14 0.287392 NaN 0.378315

