pandas 识别连续出现的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37934399/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:25:40  来源:igfitidea点击:

Identifying consecutive occurrences of a value

pythonpandasdataframeitertools

提问by Stefano Potter

I have a df like so:

我有一个像这样的 df:

Count
1
0
1
1
0
0
1
1
1
0

and I want to return a 1in a new column if there are two or more consecutive occurrences of 1in Countand a 0if there is not. So in the new column each row would get a 1based on this criteria being met in the column Count. My desired output would then be:

1如果有两个或多个连续出现的1inCount和 a0如果没有,我想在新列中返回 a 。因此,在新列中,每行都将1根据列中满足此条件获得Count。我想要的输出是:

Count  New_Value
1      0 
0      0
1      1
1      1
0      0
0      0
1      1
1      1 
1      1
0      0

I am thinking I may need to use itertoolsbut I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

我想我可能需要使用,itertools但我一直在阅读它,但还没有遇到我需要的东西。我希望能够使用这种方法来计算任意数量的连续出现次数,而不仅仅是 2 次。例如,有时我需要计算连续出现的 10 次,这里的示例中我只使用了 2。

采纳答案by Stefan

You could:

你可以:

df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count

to get:

要得到:

   Count  consecutive
0      1            1
1      0            0
2      1            2
3      1            2
4      0            0
5      0            0
6      1            3
7      1            3
8      1            3
9      0            0

From here you can, for any threshold:

从这里你可以,对于任何阈值:

threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)

to get:

要得到:

   Count  consecutive
0      1            0
1      0            0
2      1            1
3      1            1
4      0            0
5      0            0
6      1            1
7      1            1
8      1            1
9      0            0

or, in a single step:

或者,在一个步骤中:

(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)

In terms of efficiency, using pandasmethods provides a significant speedup when the size of the problem grows:

在效率方面,pandas当问题规模增大时,使用方法提供了显着的加速:

 df = pd.concat([df for _ in range(1000)])

%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop

compared to:

相比:

%%timeit
l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size    
pd.Series(l)

10 loops, best of 3: 76.7 ms per loop

回答by Psidom

Not sure if this is optimized, but you can give it a try:

不确定这是否已优化,但您可以尝试一下:

from itertools import groupby
import pandas as pd

l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size

df['new_Value'] = pd.Series(l)

df

Count   new_Value
0   1   0
1   0   0
2   1   1
3   1   1
4   0   0
5   0   0
6   1   1
7   1   1
8   1   1
9   0   0