pandas.DataFrame 中一列的反向累积总和

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37872565/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:24:28  来源:igfitidea点击:

Reversed cumulative sum of a column in pandas.DataFrame

pythonpandasdataframereverse

提问by wl2776

I've got a pandas DataFrame with a boolean column sorted by another column and need to calculate reverse cumulative sum of the boolean column, that is, amount of true values from current row to bottom.

我有一个带有按另一列排序的布尔列的 Pandas DataFrame,需要计算布尔列的反向累积和,即从当前行到底部的真实值的数量。

Example

例子

In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) })

In [15]: df = df.sort_values('B')

In [16]: df
Out[16]:
       A         B
6  False  0.037710
2   True  0.315414
4  False  0.332480
7  False  0.445505
3  False  0.580156
1   True  0.741551
5  False  0.796944
0   True  0.817563

I need something that will give me a new column with values

我需要一些能给我一个带有值的新列的东西

3
3
2
2
2
2
1
1

That is, for each row it should contain amount of True values on this row and rows below.

也就是说,对于每一行,它应该在该行和下面的行中包含一定数量的 True 值。

I've tried various methods using .iloc[::-1]but result is not that is desired.

我尝试了各种方法,.iloc[::-1]但结果并不理想。

Think, I'm missing an obvious thing. I've starting using Pandas only yesterday.

想想,我错过了一个明显的东西。我昨天才开始使用 Pandas。

回答by unutbu

Reverse column A, take the cumsum, then reverse again:

反转 A 列,取 cumsum,然后再次反转:

df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]


import pandas as pd
df = pd.DataFrame(
    {'A': [False, True, False, False, False, True, False, True],
     'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],},
     index=[6, 2, 4, 7, 3, 1, 5, 0])
df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]
print(df)

yields

产量

       A         B  C
6  False  0.037710  3
2   True  0.315414  3
4  False  0.332480  2
7  False  0.445505  2
3  False  0.580156  2
1   True  0.741551  2
5  False  0.796944  1
0   True  0.817563  1


Alternatively, you could count the number of Trues in column Aand subtract the (shifted) cumsum:

或者,您可以计算True列中s的数量A并减去(移位的)cumsum:

In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
Out[113]: 
6    3
2    3
4    2
7    2
3    2
1    2
5    1
0    1
Name: A, dtype: object

But this is significantly slower. Using IPythonto perform the benchmark:

但这要慢得多。使用IPython执行基准测试:

In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)})

In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
10 loops, best of 3: 19.8 ms per loop

In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1]
1000 loops, best of 3: 701 μs per loop

回答by Ichta

Similar to unutbus first suggestion, but without the deprecated ix:

类似于 unutbus 第一个建议,但没有弃用的 ix:

df['C']=df.A[::-1].cumsum()

回答by Merlin

This works but is slow... like @unutbu answer. True resolves to 1. Fails on False, or any other value though.

这有效但很慢......就像@unutbu回答一样。True 解析为 1。在 False 或任何其他值时失败。

df[2] = df.groupby('A').cumcount(ascending=False)+1
df[1] = np.where(df['A']==True,df[2],None)
df[1] = df[1].fillna(method='bfill').fillna(0)
del df[2]

      A         B    1
# 3  False  0.277557  3.0
# 7  False  0.400751  3.0
# 6  False  0.431587  3.0
# 5  False  0.481006  3.0
# 1   True  0.534364  3.0
# 2   True  0.556378  2.0
# 0   True  0.863192  1.0
# 4  False  0.916247  0.0