pandas.DataFrame 中一列的反向累积总和

Question

提问by wl2776

I've got a pandas DataFrame with a boolean column sorted by another column and need to calculate reverse cumulative sum of the boolean column, that is, amount of true values from current row to bottom.

我有一个带有按另一列排序的布尔列的 Pandas DataFrame，需要计算布尔列的反向累积和，即从当前行到底部的真实值的数量。

Example

例子

In [13]: df = pd.DataFrame({'A': [True] * 3 + [False] * 5, 'B': np.random.rand(8) })

In [15]: df = df.sort_values('B')

In [16]: df
Out[16]:
       A         B
6  False  0.037710
2   True  0.315414
4  False  0.332480
7  False  0.445505
3  False  0.580156
1   True  0.741551
5  False  0.796944
0   True  0.817563

I need something that will give me a new column with values

我需要一些能给我一个带有值的新列的东西

That is, for each row it should contain amount of True values on this row and rows below.

也就是说，对于每一行，它应该在该行和下面的行中包含一定数量的 True 值。

I've tried various methods using .iloc[::-1]but result is not that is desired.

我尝试了各种方法，.iloc[::-1]但结果并不理想。

Think, I'm missing an obvious thing. I've starting using Pandas only yesterday.

想想，我错过了一个明显的东西。我昨天才开始使用 Pandas。

Answer 1

回答by unutbu

Reverse column A, take the cumsum, then reverse again:

反转 A 列，取 cumsum，然后再次反转：

df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]

import pandas as pd
df = pd.DataFrame(
    {'A': [False, True, False, False, False, True, False, True],
     'B': [0.03771, 0.315414, 0.33248, 0.445505, 0.580156, 0.741551, 0.796944, 0.817563],},
     index=[6, 2, 4, 7, 3, 1, 5, 0])
df['C'] = df.loc[::-1, 'A'].cumsum()[::-1]
print(df)

yields

产量

       A         B  C
6  False  0.037710  3
2   True  0.315414  3
4  False  0.332480  2
7  False  0.445505  2
3  False  0.580156  2
1   True  0.741551  2
5  False  0.796944  1
0   True  0.817563  1

Alternatively, you could count the number of Trues in column Aand subtract the (shifted) cumsum:

或者，您可以计算True列中s的数量A并减去（移位的）cumsum：

In [113]: df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
Out[113]: 
6    3
2    3
4    2
7    2
3    2
1    2
5    1
0    1
Name: A, dtype: object

But this is significantly slower. Using IPythonto perform the benchmark:

但这要慢得多。使用IPython执行基准测试：

In [116]: df = pd.DataFrame({'A':np.random.randint(2, size=10**5).astype(bool)})

In [117]: %timeit df['A'].sum()-df['A'].shift(1).fillna(0).cumsum()
10 loops, best of 3: 19.8 ms per loop

In [118]: %timeit df.loc[::-1, 'A'].cumsum()[::-1]
1000 loops, best of 3: 701 μs per loop

Answer 2

回答by Ichta

Similar to unutbus first suggestion, but without the deprecated ix:

类似于 unutbus 第一个建议，但没有弃用的 ix：

df['C']=df.A[::-1].cumsum()

Answer 3

回答by Merlin

This works but is slow... like @unutbu answer. True resolves to 1. Fails on False, or any other value though.

这有效但很慢......就像@unutbu回答一样。True 解析为 1。在 False 或任何其他值时失败。

df[2] = df.groupby('A').cumcount(ascending=False)+1
df[1] = np.where(df['A']==True,df[2],None)
df[1] = df[1].fillna(method='bfill').fillna(0)
del df[2]

      A         B    1
# 3  False  0.277557  3.0
# 7  False  0.400751  3.0
# 6  False  0.431587  3.0
# 5  False  0.481006  3.0
# 1   True  0.534364  3.0
# 2   True  0.556378  2.0
# 0   True  0.863192  1.0
# 4  False  0.916247  0.0

pandas.DataFrame 中一列的反向累积总和

提问by wl2776

回答by unutbu

回答by Ichta

回答by Merlin

相关推荐

最近更新

标签

pandas.DataFrame 中一列的反向累积总和

提问by wl2776

回答by unutbu

回答by Ichta

回答by Merlin

相关推荐

pandas 如何从行和列引用返回数据框值？

pandas 将 json 嵌套到 csv - 通用方法

pandas groupby-apply 行为，返回一个系列（不一致的输出类型）

停止 jupyter notebook 在 Pandas html 表输出中包装单元格内容

相关推荐

最近更新

标签