使用 Groupby 识别 Pandas Dataframe 中连续的相同值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45886518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Identify consecutive same values in Pandas Dataframe, with a Groupby
提问by clg4
I have the following dataframe df:
我有以下数据框 df:
data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 2
1 1 2
2 1 3
3 1 2
4 1 2
5 1 2
6 1 3
7 1 3
8 1 3
9 1 3
10 2 1
11 2 4
12 2 1
13 2 1
14 2 1
15 2 4
16 2 4
17 2 1
18 2 1
19 2 1
20 2 1
21 2 1
What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.
我需要做的是在 id 级别 (df.groupby['id']) 当值连续显示相同的数字 3 次或更多次时进行识别。
I would like to have the following result for the above:
我想对上述结果有以下结果:
df
Out[12]:
id value flag
0 1 2 0
1 1 2 0
2 1 3 0
3 1 2 1
4 1 2 1
5 1 2 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1
10 2 1 0
11 2 4 0
12 2 1 1
13 2 1 1
14 2 1 1
15 2 4 0
16 2 4 0
17 2 1 1
18 2 1 1
19 2 1 1
20 2 1 1
21 2 1 1
I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:
我已经尝试了 groupby 和 lambda 的变体,使用 pandas rolling.mean 来确定滚动周期的平均值与“值”的比较,以及它们相同的地方,这表示一个标志。但这有几个问题,包括您可能有不同的值,这些值将平均为您尝试标记的值。此外,我无法弄清楚如何“标记”创建初始标记的滚动均值的所有值。看到这里,这标识了标志的“右侧”,但是我需要填充滚动平均长度的先前值。在这里查看我的代码:
test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)
And the result here:
结果在这里:
test
Out[61]:
id value rma flag
0 1 2 NaN 0
1 1 2 NaN 0
2 1 3 2.333333 0
3 1 2 2.333333 0
4 1 2 2.333333 0
5 1 2 2.000000 1
6 1 3 2.333333 0
7 1 3 2.666667 0
8 1 3 3.000000 1
9 1 3 3.000000 1
10 2 1 NaN 0
11 2 4 NaN 0
12 2 1 2.000000 0
13 2 1 2.000000 0
14 2 1 1.000000 1
15 2 4 2.000000 0
16 2 4 3.000000 0
17 2 1 3.000000 0
18 2 1 2.000000 0
19 2 1 1.000000 1
20 2 1 1.000000 1
21 2 1 1.000000 1
Can't wait to see what I am missing! Thanks
迫不及待想看看我错过了什么!谢谢
回答by Psidom
You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum()
to denote the value changes; 2) use transform('size')
to calculate the group size and compare with three, then you get the flag
column you need:
你可以试试这个;1) 创建一个额外的组变量df.value.diff().ne(0).cumsum()
来表示值的变化;2) 用于transform('size')
计算组大小并与三个进行比较,然后得到flag
您需要的列:
df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
df
Break downs:
故障:
1) diff
is not equal to zero(which is literally what df.value.diff().ne(0)
means) gives a condition True
whenever there is a value change:
1)diff
不等于零(字面df.value.diff().ne(0)
意思)True
在值发生变化时给出条件:
df.value.diff().ne(0)
#0 True
#1 False
#2 True
#3 True
#4 False
#5 False
#6 True
#7 False
#8 False
#9 False
#10 True
#11 True
#12 True
#13 False
#14 False
#15 True
#16 False
#17 True
#18 False
#19 False
#20 False
#21 False
#Name: value, dtype: bool
2) Then cumsum
gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True
is considered as one while False
is considered as zero:
2) 然后cumsum
给出一个非降序的 id 序列,其中每个 id 表示具有相同值的连续块,注意在对布尔值求和时,True
将其视为 1,而False
将其视为 0:
df.value.diff().ne(0).cumsum()
#0 1
#1 1
#2 2
#3 3
#4 3
#5 3
#6 4
#7 4
#8 4
#9 4
#10 5
#11 6
#12 7
#13 7
#14 7
#15 8
#16 8
#17 9
#18 9
#19 9
#20 9
#21 9
#Name: value, dtype: int64
3) combined with id
column, you can group the data frame, calculate the group size and get the flag
column.
3)结合id
column,可以对数据框进行分组,计算分组大小,得到flag
列。
回答by FabienP
See EDIT2 for a more robust solution
有关更强大的解决方案,请参阅 EDIT2
Same result, but a little bit faster:
相同的结果,但要快一点:
labels = (df.value != df.value.shift()).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)
id value flag
0 1 2 0
1 1 2 0
2 1 3 0
3 1 2 1
4 1 2 1
5 1 2 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1
10 2 1 0
11 2 4 0
12 2 1 1
13 2 1 1
14 2 1 1
15 2 4 0
16 2 4 0
17 2 1 1
18 2 1 1
19 2 1 1
20 2 1 1
21 2 1 1
Where:
在哪里:
df.value != df.value.shift()
gives the value changecumsum()
creates "labels" for each group of same valuelabels.value_counts()
counts the occurrences of each labellabels.map(...)
replaces labels by the counts computed above>= 3
creates a boolean mask on count valueastype(int)
casts the booleans to int
df.value != df.value.shift()
给出值变化cumsum()
为每组相同的值创建“标签”labels.value_counts()
计算每个标签的出现次数labels.map(...)
用上面计算的计数替换标签>= 3
在计数值上创建一个布尔掩码astype(int)
将布尔值转换为 int
In my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach. But mine is not one-liner.
在我手中,它在你的 df 上给出 1.03 毫秒,而 Psidoms 的方法为 2.1 毫秒。但我的不是单线。
EDIT:
编辑:
A mix between both approaches is even faster
两种方法的混合甚至更快
labels = df.value.diff().ne(0).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)
Gives 911μs with your sample df.
使用您的样本 df 给出 911μs。
EDIT2: correct solution to account for id change, as pointed by @clg4
EDIT2:解决 id 更改的正确解决方案,如@clg4 所指出的
labels = (df.value.diff().ne(0) | df.id.diff().ne(0)).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)
Where ... | df.id.diff().ne(0)
increment the label where the id changes
在哪里... | df.id.diff().ne(0)
增加 id 改变的标签
This works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms
这甚至适用于 id 更改的相同值(在索引 10 上使用值 3 进行测试)并且需要 1.28 毫秒
EDIT3: Better explanations
EDIT3:更好的解释
Take the case where index 10 has value 3. df.id.diff().ne(0)
以索引 10 的值为 3 的情况为例。 df.id.diff().ne(0)
data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,3,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df['id_diff'] = df.id.diff().ne(0).astype(int)
df['val_diff'] = df.value.diff().ne(0).astype(int)
df['diff_or'] = (df.id.diff().ne(0) | df.value.diff().ne(0)).astype(int)
df['labels'] = df['diff_or'].cumsum()
id value id_diff val_diff diff_or labels
0 1 2 1 1 1 1
1 1 2 0 0 0 1
2 1 3 0 1 1 2
3 1 2 0 1 1 3
4 1 2 0 0 0 3
5 1 2 0 0 0 3
6 1 3 0 1 1 4
7 1 3 0 0 0 4
8 1 3 0 0 0 4
9 1 3 0 0 0 4
>10 2 3 1 | 0 = 1 5 <== label increment
11 2 4 0 1 1 6
12 2 1 0 1 1 7
13 2 1 0 0 0 7
14 2 1 0 0 0 7
15 2 4 0 1 1 8
16 2 4 0 0 0 8
17 2 1 0 1 1 9
18 2 1 0 0 0 9
19 2 1 0 0 0 9
20 2 1 0 0 0 9
21 2 1 0 0 0 9
The |
is operator "bitwise-or", which gives True
as long as one of the elements is True
. So if there is no diff in value where the id changes, the |
reflects the id change. Otherwise it changes nothing.
When .cumsum()
is performed, the label is incremented where the id changes, so the value 3
at index 10 is not grouped with values 3
from indexes 6-9.
的|
是操作者“按位或”,其给出True
只要元件中的一个是True
。因此,如果 id 更改的值没有差异,则|
反映 id 更改。否则它什么都不会改变。当.cumsum()
被执行时,标签被递增其中ID的变化,因此该值3
在索引10不与值进行分组3
从索引6-9。
回答by Deepan Wadhwa
#try this simpler version
a= pd.Series([1,1,1,2,3,4,5,5,5,7,8,0,0,0])
b= a.groupby([a.ne(0), a]).transform('size').ge(3).astype('int')
#ge(x) <- x is the number of consecutive repeated values
print b
回答by Mott The Tuple
df=pd.DataFrame.from_dict(
{'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]})
df2 = df.groupby((df['value'].shift() != df['value']).\
cumsum()).filter(lambda x: len(x) >= 3)
df['flag'] = np.where(df.index.isin(df2.index),1,0)