Python 熊猫:有条件的滚动计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25119524/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:45:30  来源:igfitidea点击:

Pandas: conditional rolling count

pythonpandas

提问by justinlevol

I have a Series that looks the following:

我有一个看起来如下的系列:

   col
0  B
1  B
2  A
3  A
4  A
5  B

It's a time series, therefore the index is ordered by time.

这是一个时间序列,因此索引按时间排序。

For each row, I'd like to count how many times the value has appeared consecutively, i.e.:

对于每一行,我想计算该值连续出现的次数,即:

Output:

输出:

   col count
0  B   1
1  B   2
2  A   1 # Value does not match previous row => reset counter to 1
3  A   2
4  A   3
5  B   1 # Value does not match previous row => reset counter to 1

I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.

我发现了 2 个相关问题,但我无法弄清楚如何将该信息“写入”为 DataFrame 中的每一行的新列(如上)。使用rolling_apply 效果不佳。

Related:

有关的:

Counting consecutive events on pandas dataframe by their index

按索引计算熊猫数据帧上的连续事件

Finding consecutive segments in a pandas data frame

在 Pandas 数据框中查找连续段

回答by chrisb

Based on the second answer you linked, assuming sis your series.

根据您链接的第二个答案,假设s是您的系列。

df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))


In [88]: df
Out[88]: 
  col  block  count
0   B      1      1
1   B      1      2
2   A      2      1
3   A      2      2
4   A      2      3
5   B      3      1

回答by ZJS

I like the answer by @chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....

我喜欢@chrisb 的答案,但想分享我自己的解决方案,因为有些人可能会发现它在处理类似问题时更具可读性且更易于使用......

1) Create a function that uses static variables

1)创建一个使用静态变量的函数

def rolling_count(val):
    if val == rolling_count.previous:
        rolling_count.count +=1
    else:
        rolling_count.previous = val
        rolling_count.count = 1
    return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable

2) apply it to your Series after converting to dataframe

2)转换为数据框后将其应用于您的系列

df  = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe

output of df

df 的输出

  col  count
0   B      1
1   B      2
2   A      1
3   A      2
4   A      3
5   B      1

回答by CodeShaman

One-liner:

单线:

df['count'] = df.groupby('col').cumcount()

or

或者

df['count'] = df.groupby('col').cumcount() + 1

if you want the counts to begin at 1.

如果您希望计数从 1 开始。

回答by P.Tillmann

I think there is a nice way to combine the solution of @chrisb and @CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).

我认为有一种很好的方法可以将 @chrisb 和 @CodeShaman 的解决方案结合起来(正如有人指出的 CodeShamans 解决方案计算总数而不是连续值)。

  df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1

  col  count
0   B      1
1   B      2
2   A      1
3   A      2
4   A      3
5   B      1

回答by Benjamin Breton

If you wish to do the same thing but filter on two columns, you can use this.

如果您希望做同样的事情但过滤两列,您可以使用它。

def count_consecutive_items_n_cols(df, col_name_list, output_col):
    cum_sum_list = [
        (df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
    ]
    df[output_col] = df.groupby(
        ["_".join(map(str, x)) for x in zip(*cum_sum_list)]
    ).cumcount() + 1
    return df

col_a col_b count
0   1     B     1
1   1     B     2
2   1     A     1
3   2     A     1
4   2     A     2
5   2     B     1