pandas 如何使用pandas-python递归构造一列数据框？

Question

提问by user5779223

Give such a data frame df:

给出这样一个数据框df：

id_      val     
11111    12
12003    22
88763    19
43721    77
...

I wish to add a column diffto df, and each row of it equals to, let's say, the valin that row minus the diffin the previous row and multiply 0.4 and then add diffin the previous day:

我想添加一列diff到df，并且它的每一行等于，让我们说，在val该行中减去diff上一行和乘0.4，然后加入diff前一天：

diff = (val - diff_previousDay) * 0.4 + diff_previousDay

And the diffin the first row equals to val * 4in that row. That is, the expected dfshould be:

并且diff第一行中的等于val * 4该行中的。也就是说，预期df应该是：

id_      val     diff   
11111    12      4.8
12003    22      11.68
88763    19      14.608
43721    77      ...

And I have tried:

我试过：

mul = 0.4
df['diff'] = df.apply(lambda row: (row['val'] - df.loc[row.name, 'diff']) * mul + df.loc[row.name, 'diff'] if int(row.name) > 0 else row['val'] * mul, axis=1)

But got such as error:

但是得到了这样的错误：

TypeError: ("unsupported operand type(s) for -: 'float' and 'NoneType'", 'occurred at index 1')

类型错误：（“不支持的操作数类型 -：'float' 和 'NoneType'”，'发生在索引 1'）

Do you know how to solve this problem? Thank you in advance!

你知道如何解决这个问题吗？先感谢您！

Answer 1

采纳答案by jezrael

You can use:

您可以使用：

df.loc[0, 'diff'] = df.loc[0, 'val'] * 0.4

for i in range(1, len(df)):
    df.loc[i, 'diff'] = (df.loc[i, 'val'] - df.loc[i-1, 'diff']) * 0.4  + df.loc[i-1, 'diff']

print (df)
     id_  val     diff
0  11111   12   4.8000
1  12003   22  11.6800
2  88763   19  14.6080
3  43721   77  39.5648

The iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization. You could perhaps use apply with a function that does the same calculation as the loop, but behind the scenes this would also be a loop.

输入依赖于先前步骤的结果的计算的迭代性质使矢量化复杂化。您也许可以将 apply 与执行与循环相同的计算的函数一起使用，但在幕后这也将是一个循环。

Answer 2

回答by jpp

Recursive functions are not easily vectorisable. However, you can optimize your algorithm with numba. This should be preferable to a regular loop.

递归函数不容易矢量化。但是，您可以使用numba. 这应该比常规循环更可取。

from numba import jit

@jit(nopython=True)
def foo(val):
    diff = np.zeros(val.shape)
    diff[0] = val[0] * 0.4
    for i in range(1, diff.shape[0]):
        diff[i] = (val[i] - diff[i-1]) * 0.4 + diff[i-1]
    return diff

df['diff'] = foo(df['val'].values)

print(df)

     id_  val     diff
0  11111   12   4.8000
1  12003   22  11.6800
2  88763   19  14.6080
3  43721   77  39.5648

Answer 3

回答by Michael Tamillow

if you are using apply in pandas, you should not be using the dataframe again within the lambda function.

如果您在 Pandas 中使用 apply，则不应在 lambda 函数中再次使用数据框。

your object in all cases within the lambda function should be 'row'.

在 lambda 函数内的所有情况下，您的对象都应该是“行”。

pandas 如何使用pandas-python递归构造一列数据框？

提问by user5779223

采纳答案by jezrael

回答by jpp

回答by Michael Tamillow

相关推荐

最近更新

标签

pandas 如何使用pandas-python递归构造一列数据框？

提问by user5779223

采纳答案by jezrael

回答by jpp

回答by Michael Tamillow

相关推荐

pandas 如何绘制数据帧？在 Python 中

pandas 没有日期时间索引的熊猫数据帧每天重新采样

pandas.DataFrame 中一列的反向累积总和

使用 '.' 访问 pandas.DataFrame 列名 在里面

相关推荐

最近更新

标签

使用 '.' 访问 pandas.DataFrame 列名在里面