在 group() 上的 Pandas 中使用 cumsum

Question

提问by msteen

From a Pandas newbie: I have data that looks essentially like this -

来自 Pandas 新手：我的数据基本上是这样的 -

 data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]: 
           Bool  Data Dir
2000-12-30    Y     4   E
2000-12-30    N     5   E
2000-12-30    Y     6   W
2001-01-02    N     7   W
2001-01-03    Y     8   E
2001-01-03    N     9   W
2000-12-30    Y    10   W
2000-12-30    N    11   E

And I want to group it by multiple levels, then do a cumsum():

我想按多个级别对其进行分组，然后执行 cumsum()：

E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum()<-(Doesn't work)

例如，像running_sum=data1.groupby(['Bool','Dir']).cumsum()<-（不起作用）

with output that would look something like:

输出看起来像：

Bool Dir Date        running_sum
N    E   2000-12-30           16
     W   2001-01-02            7
         2001-01-03           16
Y    E   2000-12-30            4
         2001-01-03           12
     W   2000-12-30           16

My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.

我的“喜欢”代码显然还不够接近。我进行了多次尝试，并学到了许多关于如何不这样做的新知识。

Thanks for any help you can give.

谢谢你提供的所有帮助。

Answer 1

回答by bdiamante

Try this:

尝试这个：

data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"])   # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()

The reason you cannot simply use cumsumon data3has to do with how your data is structured. Grouping by Booland Dirand applying an aggregation function (sum, mean, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsumis not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum, it will throw an error. That's why I called sumfirst, which returns a DataFrame in the correct input format.

您不能简单地使用cumsumon的原因data3与您的数据结构有关。按Bool和分组Dir并应用聚合函数（sum、mean等）将生成一个比您开始时更小的数据帧，因为您使用的任何函数都会根据您的组键聚合值。然而cumsum不是聚合函数。它将返回一个与调用它的数据帧大小相同的数据帧。因此，除非您的输入 DataFrame 的格式在调用后输出可以保持相同的大小cumsum，否则它将引发错误。这就是我sum首先调用的原因，它以正确的输入格式返回一个 DataFrame。

Sorry if I haven't explained this well enough. Maybe someone else could help me out?

对不起，如果我没有很好地解释这一点。也许其他人可以帮助我？

Answer 2

回答by Malina

As the other answer points out, you're trying to collapse identical dates into single rows, whereas the cumsum function will return a series of the same length as the original DataFrame. Stated differently, you actually want to group by [Bool, Dir, Date], calculate a sum in each group, THEN return a cumsum on rows grouped by [Bool, Dir]. The other answer is a perfectly valid solution to your specific question, here's a one-liner variation:

正如另一个答案所指出的那样，您试图将相同的日期折叠成单行，而 cumsum 函数将返回与原始 DataFrame 长度相同的一系列。换句话说，您实际上想要按 [Bool, Dir, Date] 分组，计算每个组中的总和，然后返回按 [Bool, Dir] 分组的行的总和。另一个答案是针对您的特定问题的完全有效的解决方案，这是一个单行变体：

data1.groupby(['Bool', 'Dir', 'Date']).sum().groupby(level=[0, 1]).cumsum()

This returns output exactly in the requested format.

这将完全以请求的格式返回输出。

For those looking for a simple cumsum on a Pandas group, you can use:

对于那些在 Pandas 组上寻找简单 cumsum 的人，您可以使用：

data1.groupby(['Bool', 'Dir']).apply(lambda x: x['Data'].cumsum())

The cumulative sum is calculated internal to each group. Here's what the output looks like:

累积总和在每个组内部计算。输出如下所示：

Bool  Dir            
N     E    2000-12-30     5
           2000-12-30    16
      W    2001-01-02     7
           2001-01-03    16
Y     E    2000-12-30     4
           2001-01-03    12
      W    2000-12-30     6
           2000-12-30    16
Name: Data, dtype: int64

Note the repeated dates, but this is doing a strict cumulative sum internal to the rows of each group identified by the Bool and Dir columns.

注意重复的日期，但这是对 Bool 和 Dir 列标识的每个组的行内部进行严格的累积总和。

在 group() 上的 Pandas 中使用 cumsum

提问by msteen

回答by bdiamante

回答by Malina

相关推荐

最近更新

标签

在 group() 上的 Pandas 中使用 cumsum

提问by msteen

回答by bdiamante

回答by Malina

相关推荐

pandas 如何将数据帧的单个值除以月平均值？

如何从字符串列生成 Categorical 的 Pandas DataFrame 列？

pandas join/merge '重新索引仅对唯一值索引有效'

pandas 在 Python 中的两个列表/数组中查找最近的项目

相关推荐

最近更新

标签