Python 当应用中也计算了先前的值时,Pandas 是否有办法在 dataframe.apply 中使用先前的行值?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34855859/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:38:00  来源:igfitidea点击:

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

pythonpandasdataframefor-loopiteration

提问by ctrl-alt-delete

I have the following dataframe:

我有以下数据框:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

Require:

要求:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column Cis derived for 2015-01-31by taking valueof D.

Column C2015-01-31通过取推导出来valueD

Then I need to use the valueof Cfor 2015-01-31and multiply by the valueof Aon 2015-02-01and add B.

然后我需要使用valueof Cfor2015-01-31并乘以valueof Aon2015-02-01并添加B

I have attempted an applyand a shiftusing an if elseby this gives a key error.

我已尝试applyshift使用if else该给出一个关键的错误。

采纳答案by Stefan

First, create the derived value:

首先,创建派生值:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

然后遍历剩余的行并填充计算值:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

回答by Stefan

Applying the recursive function on numpy arrays will be faster than the current answer.

在 numpy 数组上应用递归函数将比当前答案更快。

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

输出

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

回答by kztd

Given a column of numbers:

给定一列数字:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

您可以使用 shift 引用上一行:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

回答by iipr

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

虽然这个问题已经有一段时间了,但我会发布我的答案,希望对某人有所帮助。

Disclaimer:I know this solution is not standard, but I think it works well.

免责声明:我知道这个解决方案不是标准的,但我认为它运作良好。

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a applyfrom pandas and the help of a global variable that keeps track of the previous calculated value.

所以基本上我们使用apply来自熊猫的 a 和跟踪先前计算值的全局变量的帮助。



Time comparison with a forloop:

for循环的时间比较:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 3.2 s ± 114 ms(7 次运行的平均值 ± 标准偏差,每次 1 次循环)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

每个循环 1.82 秒 ± 64.4 毫秒(平均值 ± 标准偏差,7 次运行,每个循环 1 次)

So 0.57 times faster on average.

所以平均快 0.57 倍。

回答by jpp

numba

numba

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular forloop and use the decorator @njitor (for older versions) @jit(nopython=True):

对于不可矢量化的递归计算numba,使用 JIT 编译并使用较低级别对象的 ,通常会产生很大的性能改进。您只需要定义一个常规for循环并使用装饰器@njit或(对于旧版本)@jit(nopython=True)

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular forloop:

对于合理大小的数据帧,与常规for循环相比,这可以提高约 30 倍的性能:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop