在 group() 上的 Pandas 中使用 cumsum

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15755057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:45:12  来源:igfitidea点击:

Using cumsum in pandas on group()

pythongroup-bypandas

提问by msteen

From a Pandas newbie: I have data that looks essentially like this -

来自 Pandas 新手:我的数据基本上是这样的 -

 data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]: 
           Bool  Data Dir
2000-12-30    Y     4   E
2000-12-30    N     5   E
2000-12-30    Y     6   W
2001-01-02    N     7   W
2001-01-03    Y     8   E
2001-01-03    N     9   W
2000-12-30    Y    10   W
2000-12-30    N    11   E

And I want to group it by multiple levels, then do a cumsum():

我想按多个级别对其进行分组,然后执行 cumsum():

E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum()<-(Doesn't work)

例如,像running_sum=data1.groupby(['Bool','Dir']).cumsum()<-(不起作用)

with output that would look something like:

输出看起来像:

Bool Dir Date        running_sum
N    E   2000-12-30           16
     W   2001-01-02            7
         2001-01-03           16
Y    E   2000-12-30            4
         2001-01-03           12
     W   2000-12-30           16

My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.

我的“喜欢”代码显然还不够接近。我进行了多次尝试,并学到了许多关于如何不这样做的新知识。

Thanks for any help you can give.

谢谢你提供的所有帮助。

回答by bdiamante

Try this:

尝试这个:

data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"])   # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()

The reason you cannot simply use cumsumon data3has to do with how your data is structured. Grouping by Booland Dirand applying an aggregation function (sum, mean, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsumis not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum, it will throw an error. That's why I called sumfirst, which returns a DataFrame in the correct input format.

您不能简单地使用cumsumon的原因data3与您的数据结构有关。按Bool和分组Dir并应用聚合函数(summean等) 将生成一个比您开始时更小的数据帧,因为您使用的任何函数都会根据您的组键聚合值。然而cumsum不是聚合函数。它将返回一个与调用它的数据帧大小相同的数据帧。因此,除非您的输入 DataFrame 的格式在调用后输出可以保持相同的大小cumsum,否则它将引发错误。这就是我sum首先调用的原因,它以正确的输入格式返回一个 DataFrame。

Sorry if I haven't explained this well enough. Maybe someone else could help me out?

对不起,如果我没有很好地解释这一点。也许其他人可以帮助我?

回答by Malina

As the other answer points out, you're trying to collapse identical dates into single rows, whereas the cumsum function will return a series of the same length as the original DataFrame. Stated differently, you actually want to group by [Bool, Dir, Date], calculate a sum in each group, THEN return a cumsum on rows grouped by [Bool, Dir]. The other answer is a perfectly valid solution to your specific question, here's a one-liner variation:

正如另一个答案所指出的那样,您试图将相同的日期折叠成单行,而 cumsum 函数将返回与原始 DataFrame 长度相同的一系列。换句话说,您实际上想要按 [Bool, Dir, Date] 分组,计算每个组中的总和,然后返回按 [Bool, Dir] 分组的行的总和。另一个答案是针对您的特定问题的完全有效的解决方案,这是一个单行变体:

data1.groupby(['Bool', 'Dir', 'Date']).sum().groupby(level=[0, 1]).cumsum()

This returns output exactly in the requested format.

这将完全以请求的格式返回输出。

For those looking for a simple cumsum on a Pandas group, you can use:

对于那些在 Pandas 组上寻找简单 cumsum 的人,您可以使用:

data1.groupby(['Bool', 'Dir']).apply(lambda x: x['Data'].cumsum())

The cumulative sum is calculated internal to each group. Here's what the output looks like:

累积总和在每个组内部计算。输出如下所示:

Bool  Dir            
N     E    2000-12-30     5
           2000-12-30    16
      W    2001-01-02     7
           2001-01-03    16
Y     E    2000-12-30     4
           2001-01-03    12
      W    2000-12-30     6
           2000-12-30    16
Name: Data, dtype: int64

Note the repeated dates, but this is doing a strict cumulative sum internal to the rows of each group identified by the Bool and Dir columns.

注意重复的日期,但这是对 Bool 和 Dir 列标识的每个组的行内部进行严格的累积总和。