使用 Groupby 识别 Pandas Dataframe 中连续的相同值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45886518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:19:41  来源:igfitidea点击:

Identify consecutive same values in Pandas Dataframe, with a Groupby

pythonpandasnumpylambda

提问by clg4

I have the following dataframe df:

我有以下数据框 df:

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]: 
    id  value
0    1      2
1    1      2
2    1      3
3    1      2
4    1      2
5    1      2
6    1      3
7    1      3
8    1      3
9    1      3
10   2      1
11   2      4
12   2      1
13   2      1
14   2      1
15   2      4
16   2      4
17   2      1
18   2      1
19   2      1
20   2      1
21   2      1

What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.

我需要做的是在 id 级别 (df.groupby['id']) 当值连续显示相同的数字 3 次或更多次时进行识别。

I would like to have the following result for the above:

我想对上述结果有以下结果:

df
Out[12]: 
    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1

I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:

我已经尝试了 groupby 和 lambda 的变体,使用 pandas rolling.mean 来确定滚动周期的平均值与“值”的比较,以及它们相同的地方,这表示一个标志。但这有几个问题,包括您可能有不同的值,这些值将平均为您尝试标记的值。此外,我无法弄清楚如何“标记”创建初始标记的滚动均值的所有值。看到这里,这标识了标志的“右侧”,但是我需要填充滚动平均长度的先前值。在这里查看我的代码:

test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)

And the result here:

结果在这里:

test
Out[61]: 
    id  value       rma  flag
0    1      2       NaN     0
1    1      2       NaN     0
2    1      3  2.333333     0
3    1      2  2.333333     0
4    1      2  2.333333     0
5    1      2  2.000000     1
6    1      3  2.333333     0
7    1      3  2.666667     0
8    1      3  3.000000     1
9    1      3  3.000000     1
10   2      1       NaN     0
11   2      4       NaN     0
12   2      1  2.000000     0
13   2      1  2.000000     0
14   2      1  1.000000     1
15   2      4  2.000000     0
16   2      4  3.000000     0
17   2      1  3.000000     0
18   2      1  2.000000     0
19   2      1  1.000000     1
20   2      1  1.000000     1
21   2      1  1.000000     1

Can't wait to see what I am missing! Thanks

迫不及待想看看我错过了什么!谢谢

回答by Psidom

You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum()to denote the value changes; 2) use transform('size')to calculate the group size and compare with three, then you get the flagcolumn you need:

你可以试试这个;1) 创建一个额外的组变量df.value.diff().ne(0).cumsum()来表示值的变化;2) 用于transform('size')计算组大小并与三个进行比较,然后得到flag您需要的列:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df

enter image description here

在此处输入图片说明



Break downs:

故障

1) diffis not equal to zero(which is literally what df.value.diff().ne(0)means) gives a condition Truewhenever there is a value change:

1)diff不等于零(字面df.value.diff().ne(0)意思)True在值发生变化时给出条件:

df.value.diff().ne(0)
#0      True
#1     False
#2      True
#3      True
#4     False
#5     False
#6      True
#7     False
#8     False
#9     False
#10     True
#11     True
#12     True
#13    False
#14    False
#15     True
#16    False
#17     True
#18    False
#19    False
#20    False
#21    False
#Name: value, dtype: bool

2) Then cumsumgives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, Trueis considered as one while Falseis considered as zero:

2) 然后cumsum给出一个非降序的 id 序列,其中每个 id 表示具有相同值的连续块,注意在对布尔值求和时,True将其视为 1,而False将其视为 0:

df.value.diff().ne(0).cumsum()
#0     1
#1     1
#2     2
#3     3
#4     3
#5     3
#6     4
#7     4
#8     4
#9     4
#10    5
#11    6
#12    7
#13    7
#14    7
#15    8
#16    8
#17    9
#18    9
#19    9
#20    9
#21    9
#Name: value, dtype: int64

3) combined with idcolumn, you can group the data frame, calculate the group size and get the flagcolumn.

3)结合idcolumn,可以对数据框进行分组,计算分组大小,得到flag列。

回答by FabienP

See EDIT2 for a more robust solution

有关更强大的解决方案,请参阅 EDIT2

Same result, but a little bit faster:

相同的结果,但要快一点:

labels = (df.value != df.value.shift()).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1

Where:

在哪里:

  1. df.value != df.value.shift()gives the value change
  2. cumsum()creates "labels" for each group of same value
  3. labels.value_counts()counts the occurrences of each label
  4. labels.map(...)replaces labels by the counts computed above
  5. >= 3creates a boolean mask on count value
  6. astype(int)casts the booleans to int
  1. df.value != df.value.shift()给出值变化
  2. cumsum()为每组相同的值创建“标签”
  3. labels.value_counts()计算每个标签的出现次数
  4. labels.map(...)用上面计算的计数替换标签
  5. >= 3在计数值上创建一个布尔掩码
  6. astype(int)将布尔值转换为 int

In my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach. But mine is not one-liner.

在我手中,它在你的 df 上给出 1.03 毫秒,而 Psidoms 的方法为 2.1 毫秒。但我的不是单线。



EDIT:

编辑:

A mix between both approaches is even faster

两种方法的混合甚至更快

labels = df.value.diff().ne(0).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Gives 911μs with your sample df.

使用您的样本 df 给出 911μs。



EDIT2: correct solution to account for id change, as pointed by @clg4

EDIT2:解决 id 更改的正确解决方案,如@clg4 所指出的

labels = (df.value.diff().ne(0) | df.id.diff().ne(0)).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Where ... | df.id.diff().ne(0)increment the label where the id changes

在哪里... | df.id.diff().ne(0)增加 id 改变的标签

This works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms

这甚至适用于 id 更改的相同值(在索引 10 上使用值 3 进行测试)并且需要 1.28 毫秒

EDIT3: Better explanations

EDIT3:更好的解释

Take the case where index 10 has value 3. df.id.diff().ne(0)

以索引 10 的值为 3 的情况为例。 df.id.diff().ne(0)

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,3,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)

df['id_diff'] = df.id.diff().ne(0).astype(int)
df['val_diff'] = df.value.diff().ne(0).astype(int)
df['diff_or'] = (df.id.diff().ne(0) | df.value.diff().ne(0)).astype(int)
df['labels'] = df['diff_or'].cumsum()

     id  value  id_diff  val_diff  diff_or  labels
 0    1      2        1         1        1       1
 1    1      2        0         0        0       1
 2    1      3        0         1        1       2
 3    1      2        0         1        1       3
 4    1      2        0         0        0       3
 5    1      2        0         0        0       3
 6    1      3        0         1        1       4
 7    1      3        0         0        0       4
 8    1      3        0         0        0       4
 9    1      3        0         0        0       4
>10   2      3        1    |    0    =   1       5 <== label increment
 11   2      4        0         1        1       6
 12   2      1        0         1        1       7
 13   2      1        0         0        0       7
 14   2      1        0         0        0       7
 15   2      4        0         1        1       8
 16   2      4        0         0        0       8
 17   2      1        0         1        1       9
 18   2      1        0         0        0       9
 19   2      1        0         0        0       9
 20   2      1        0         0        0       9
 21   2      1        0         0        0       9

The |is operator "bitwise-or", which gives Trueas long as one of the elements is True. So if there is no diff in value where the id changes, the |reflects the id change. Otherwise it changes nothing. When .cumsum()is performed, the label is incremented where the id changes, so the value 3at index 10 is not grouped with values 3from indexes 6-9.

|是操作者“按位或”,其给出True只要元件中的一个是True。因此,如果 id 更改的值没有差异,则|反映 id 更改。否则它什么都不会改变。当.cumsum()被执行时,标签被递增其中ID的变化,因此该值3在索引10不与值进行分组3从索引6-9。

回答by Deepan Wadhwa

#try this simpler version
a= pd.Series([1,1,1,2,3,4,5,5,5,7,8,0,0,0])
b= a.groupby([a.ne(0), a]).transform('size').ge(3).astype('int')
#ge(x) <- x is the number of consecutive repeated values 
print b

回答by Mott The Tuple

df=pd.DataFrame.from_dict(
        {'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
         'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]})

df2 = df.groupby((df['value'].shift() != df['value']).\
                cumsum()).filter(lambda x: len(x) >= 3)

df['flag'] = np.where(df.index.isin(df2.index),1,0)