pandas 带有条件的列上的熊猫累积总和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29421356/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:09:24  来源:igfitidea点击:

Pandas cumulative sum on column with condition

pythonpandasdataframe

提问by rpeczykowski

I didn't found answer elsewhere, so I need to ask. Probably because I don't know how to correctly name it. (English is not my origin language)

我在其他地方没有找到答案,所以我需要问一下。可能是因为我不知道如何正确命名它。(英语不是我的母语)

I have large datetime data frame. Time is important here. One column in df has values [Nan, 1, -1]. I need to perform quick calculation to have cumulative sum reseting when value is changing.

我有很大的日期时间数据框。时间在这里很重要。df 中的一列具有值 [Nan, 1, -1]。我需要执行快速计算以在值发生变化时重置累积总和。

Example.

例子。

    Time                 sign    desire_value
2014-01-24 05:00:00      Nan     Nan 
2014-01-24 06:00:00      Nan     Nan
2014-01-24 07:00:00      Nan     Nan 
2014-01-24 08:00:00      1       1
2014-01-24 09:00:00      1       2
2014-01-24 10:00:00      1       3
2014-01-24 11:00:00      -1      1
2014-01-24 12:00:00      -1      2
2014-01-24 13:00:00      -1      3
2014-01-24 14:00:00      -1      4
2014-01-24 15:00:00      -1      5
2014-01-24 16:00:00      1       1
2014-01-24 17:00:00      1       2
2014-01-24 18:00:00      1       3
2014-01-24 19:00:00      -1      1
2014-01-24 20:00:00      -1      2  
2014-01-24 21:00:00      1       1
2014-01-24 22:00:00      1       2

I have working solution using function, but it is not very efficient.

我有使用函数的工作解决方案,但效率不高。

    df['sign_1'] = df['sign'].shift(1)

    for index, row in df.iterrows():
        if row.sign is None:
            df.loc[line, 'desire_value'] = None
        elif row.sign == row.sign_1:
            acc += 1
            df.loc[index, 'desire_value'] = acc
        else:
            acc = 1 
            df.loc[index, 'desire_value'] = acc

I cannot find any array based approach. I found that the best way to iterate efficiently in Python is using Cython, but is there more "Python" way to solve this?

我找不到任何基于数组的方法。我发现在 Python 中高效迭代的最佳方法是使用 Cython,但是否有更多“Python”方法来解决这个问题?

回答by Jeff

see the last section here

这里看到最后一部分

This is an itertools like groupby

这是一个类似 groupby 的 itertools

In [86]: v = df['value'].dropna()

The grouper is separated on the group breakpoints; cumsum makes it have separate groups

石斑鱼在组断点上分离;cumsum 使它具有单独的组

In [87]: grouper = (v!=v.shift()).cumsum()

In [88]: grouper
Out[88]: 
3     1
4     1
5     1
6     2
7     2
8     2
9     2
10    2
11    3
12    3
13    3
14    4
15    4
16    5
17    5
Name: value, dtype: int64

Then just a simple cumsum

然后只是一个简单的cumsum

In [89]: df.groupby(grouper)['value'].cumsum()
Out[89]: 
0    NaN
1    NaN
2    NaN
3      1
4      2
5      3
6     -1
7     -2
8     -3
9     -4
10    -5
11     1
12     2
13     3
14    -1
15    -2
16     1
17     2
dtype: float64

You can certainly .abs()the above if you do in fact want the absolute values.

.abs()如果您确实想要绝对值,您当然可以进行上述操作。