Python 比较 Pandas DataFrame 中的前一行值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41399538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:53:28  来源:igfitidea点击:

Comparing previous row values in Pandas DataFrame

pythonpandasnumpybooleanshift

提问by jth359

import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df


         col1  
    0     1          
    1     3          
    2     3          
    3     1          
    4     2          
    5     3          
    6     2          
    7     2          

I have the following Pandas DataFrame and I want to create another column that compares the previous row of col1 to see if they are equal. What would be the best way to do this? It would be like the following DataFrame. Thanks

我有以下 Pandas DataFrame,我想创建另一列来比较 col1 的前一行以查看它们是否相等。什么是最好的方法来做到这一点?它就像下面的 DataFrame。谢谢

    col1  match  
0     1   False     
1     3   False     
2     3   True     
3     1   False     
4     2   False     
5     3   False     
6     2   False     
7     2   True     

回答by jezrael

You need eqwith shift:

你需要eqshift

df['match'] = df.col1.eq(df.col1.shift())
print (df)
   col1  match
0     1  False
1     3  False
2     3   True
3     1  False
4     2  False
5     3  False
6     2  False
7     2   True

Or instead equse ==, but it is a bit slowier in large DataFrame:

或者改为eq使用==,但在大型 DataFrame 中它会慢一点:

df['match'] = df.col1 == df.col1.shift()
print (df)
   col1  match
0     1  False
1     3  False
2     3   True
3     1  False
4     2  False
5     3  False
6     2  False
7     2   True

Timings:

时间

import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print (df)
#[80000 rows x 1 columns]
df = pd.concat([df]*10000).reset_index(drop=True)

df['match'] = df.col1 == df.col1.shift()
df['match1'] = df.col1.eq(df.col1.shift())
print (df)

In [208]: %timeit df.col1.eq(df.col1.shift())
The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 933 μs per loop

In [209]: %timeit df.col1 == df.col1.shift()
1000 loops, best of 3: 1 ms per loop

回答by Divakar

Here's a NumPy arrays based approach using slicingthat lets us use the views into the input array for efficiency purposes -

这是一种基于 NumPy 数组的方法slicing,它允许我们将视图用于输入数组以提高效率 -

def comp_prev(a):
    return np.concatenate(([False],a[1:] == a[:-1]))

df['match'] = comp_prev(df.col1.values)

Sample run -

样品运行 -

In [48]: df['match'] = comp_prev(df.col1.values)

In [49]: df
Out[49]: 
   col1  match
0     1  False
1     3  False
2     3   True
3     1  False
4     2  False
5     3  False
6     2  False
7     2   True

Runtime test -

运行时测试 -

In [56]: data={'col1':[1,3,3,1,2,3,2,2]}
    ...: df0=pd.DataFrame(data,columns=['col1'])
    ...: 

#@jezrael's soln1
In [57]: df = pd.concat([df0]*10000).reset_index(drop=True)

In [58]: %timeit df['match'] = df.col1 == df.col1.shift() 
1000 loops, best of 3: 1.53 ms per loop

#@jezrael's soln2
In [59]: df = pd.concat([df0]*10000).reset_index(drop=True)

In [60]: %timeit df['match'] = df.col1.eq(df.col1.shift())
1000 loops, best of 3: 1.49 ms per loop

#@Nickil Maveli's soln1   
In [61]: df = pd.concat([df0]*10000).reset_index(drop=True)

In [64]: %timeit df['match'] = df['col1'].diff().eq(0) 
1000 loops, best of 3: 1.02 ms per loop

#@Nickil Maveli's soln2
In [65]: df = pd.concat([df0]*10000).reset_index(drop=True)

In [66]: %timeit df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 0
1000 loops, best of 3: 1.52 ms per loop

# Posted approach in this post
In [67]: df = pd.concat([df0]*10000).reset_index(drop=True)

In [68]: %timeit df['match'] = comp_prev(df.col1.values)
1000 loops, best of 3: 376 μs per loop

回答by Nickil Maveli

1) pandas approach:Use diff:

1)熊猫方法:使用diff

df['match'] = df['col1'].diff().eq(0)


2) numpy approach:Use np.ediff1d.

2) numpy 方法:使用np.ediff1d.

df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 0

Both produce:

两者都产生:

enter image description here

在此处输入图片说明

Timings:(for the same DFused by @jezrael)

时间:(DF@jezrael 使用的相同)

%timeit df.col1.eq(df.col1.shift())
1000 loops, best of 3: 731 μs per loop

%timeit df['col1'].diff().eq(0)
1000 loops, best of 3: 405 μs per loop

回答by SEDaradji

I'm surprised no one mentioned rollingmethod here. rollingcan be easily used to verify if the n-previous values are all the same or to perform any custom operations. This is certainly not as fast as using diff or shift but it can be easily adapted for larger windows:

我很惊讶这里没有人提到滚动方法。滚动可以很容易地用于验证 n-previous 值是否都相同或执行任何自定义操作。这当然不如使用 diff 或 shift 快,但它可以很容易地适应更大的窗口:

df['match'] = df['col1'].rolling(2).apply(lambda x: len(set(x)) != len(x),raw= True).replace({0 : False, 1: True})