pandas 如何使用pandas-python递归构造一列数据框?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38008390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:27:25  来源:igfitidea点击:

How to constuct a column of data frame recursively with pandas-python?

pythonpandasrecursiondataframemultiple-columns

提问by user5779223

Give such a data frame df:

给出这样一个数据框df

id_      val     
11111    12
12003    22
88763    19
43721    77
...

I wish to add a column diffto df, and each row of it equals to, let's say, the valin that row minus the diffin the previous row and multiply 0.4 and then add diffin the previous day:

我想添加一列diffdf,并且它的每一行等于,让我们说,在val该行中减去diff上一行和乘0.4,然后加入diff前一天:

diff = (val - diff_previousDay) * 0.4 + diff_previousDay

And the diffin the first row equals to val * 4in that row. That is, the expected dfshould be:

并且diff第一行中的等于val * 4该行中的。也就是说,预期df应该是:

id_      val     diff   
11111    12      4.8
12003    22      11.68
88763    19      14.608
43721    77      ...

And I have tried:

我试过:

mul = 0.4
df['diff'] = df.apply(lambda row: (row['val'] - df.loc[row.name, 'diff']) * mul + df.loc[row.name, 'diff'] if int(row.name) > 0 else row['val'] * mul, axis=1) 

But got such as error:

但是得到了这样的错误:

TypeError: ("unsupported operand type(s) for -: 'float' and 'NoneType'", 'occurred at index 1')

类型错误:(“不支持的操作数类型 -:'float' 和 'NoneType'”,'发生在索引 1')

Do you know how to solve this problem? Thank you in advance!

你知道如何解决这个问题吗?先感谢您!

采纳答案by jezrael

You can use:

您可以使用:

df.loc[0, 'diff'] = df.loc[0, 'val'] * 0.4

for i in range(1, len(df)):
    df.loc[i, 'diff'] = (df.loc[i, 'val'] - df.loc[i-1, 'diff']) * 0.4  + df.loc[i-1, 'diff']

print (df)
     id_  val     diff
0  11111   12   4.8000
1  12003   22  11.6800
2  88763   19  14.6080
3  43721   77  39.5648

The iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization. You could perhaps use apply with a function that does the same calculation as the loop, but behind the scenes this would also be a loop.

输入依赖于先前步骤的结果的计算的迭代性质使矢量化复杂化。您也许可以将 apply 与执行与循环相同的计算的函数一起使用,但在幕后这也将是一个循环。

回答by jpp

Recursive functions are not easily vectorisable. However, you can optimize your algorithm with numba. This should be preferable to a regular loop.

递归函数不容易矢量化。但是,您可以使用numba. 这应该比常规循环更可取。

from numba import jit

@jit(nopython=True)
def foo(val):
    diff = np.zeros(val.shape)
    diff[0] = val[0] * 0.4
    for i in range(1, diff.shape[0]):
        diff[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
    return diff

df['diff'] = foo(df['val'].values)

print(df)

     id_  val     diff
0  11111   12   4.8000
1  12003   22  11.6800
2  88763   19  14.6080
3  43721   77  39.5648

回答by Michael Tamillow

if you are using apply in pandas, you should not be using the dataframe again within the lambda function.

如果您在 Pandas 中使用 apply,则不应在 lambda 函数中再次使用数据框。

your object in all cases within the lambda function should be 'row'.

在 lambda 函数内的所有情况下,您的对象都应该是“行”。