Python pandas rolling_apply 两列输入到函数中

Question

提问by h.l.m

Following on from this question Python custom function using rolling_apply for pandas, about using rolling_apply. Although I have progressed with my function, I am struggling to deal with a function that requires two or more columns as inputs:

继这个问题Python 自定义函数 using rolling_apply for pandas 之后，关于使用rolling_apply. 尽管我的函数已经取得了进展，但我仍在努力处理需要两列或更多列作为输入的函数：

Creating the same setup as before

创建与以前相同的设置

import pandas as pd
import numpy as np
import random

tmp  = pd.DataFrame(np.random.randn(2000,2)/10000, 
                    index=pd.date_range('2001-01-01',periods=2000),
                    columns=['A','B'])

But changing the function slightly to take two columns.

但是稍微改变函数以取两列。

def gm(df,p):
    df = pd.DataFrame(df)
    v =((((df['A']+df['B'])+1).cumprod())-1)*p
    return v.iloc[-1]

It produces the following error:

它产生以下错误：

pd.rolling_apply(tmp,50,lambda x: gm(x,5))

  KeyError: u'no item named A'

I think it is because the input to the lambda function is an ndarray of length 50 and only of the first column, and doesn't take two columns as the input. Is there a way to get both columns as inputs and use it in a rolling_applyfunction.

我认为这是因为 lambda 函数的输入是长度为 50 且仅第一列的 ndarray，并且不以两列作为输入。有没有办法将两列都作为输入并在rolling_apply函数中使用它。

Again any help would be greatly appreciated...

再次，任何帮助将不胜感激......

Answer 1

回答by lowtech

Looks like rolling_apply will try to convert input of user func into ndarray (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.stats.moments.rolling_apply.html?highlight=rolling_apply#pandas.stats.moments.rolling_apply).

看起来rolling_apply会尝试将用户func的输入转换为ndarray（http://pandas.pydata.org/pandas-docs/stable/generated/pandas.stats.moments.rolling_apply.html?highlight=rolling_apply#pandas.stats。 moment.rolling_apply）。

Workaround based on using aux column iiwhich is used to select window inside of manipulating function gm:

基于使用辅助列ii 的解决方法，该列用于在操作函数 gm 中选择窗口：

import pandas as pd
import numpy as np
import random

tmp = pd.DataFrame(np.random.randn(2000,2)/10000, columns=['A','B'])
tmp['date'] = pd.date_range('2001-01-01',periods=2000)
tmp['ii'] = range(len(tmp))            

def gm(ii, df, p):
    x_df = df.iloc[map(int, ii)]
    #print x_df
    v =((((x_df['A']+x_df['B'])+1).cumprod())-1)*p
    #print v
    return v.iloc[-1]

#print tmp.head()
res = pd.rolling_apply(tmp.ii, 50, lambda x: gm(x, tmp, 5))
print res

Answer 2

回答by calestini

Not sure if still relevant here, with the new rollingclasses on pandas, whenever we pass raw=Falseto apply, we are actually passing the series to the wraper, which means we have access to the index of each observation, and can use that to further handle multiple columns.

不确定这里是否仍然相关，对于 Pandasrolling上的新类，每当我们传递raw=False到时apply，我们实际上是将系列传递给包装器，这意味着我们可以访问每个观察的索引，并且可以使用它来进一步处理多列.

From the docs:

从文档：

raw: bool, default None
False : passes each row or column as a Series to the function.
True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

raw: bool, 默认无
False ：将每一行或每一列作为系列传递给函数。
True 或 None ：传递的函数将接收 ndarray 对象。如果您只是应用 NumPy 缩减功能，这将获得更好的性能。

In this scenario, we can do the following:

在这种情况下，我们可以执行以下操作：

### create a func for multiple columns
def cust_func(s):

    val_for_col2 = df.loc[s.index, col2] #.values
    val_for_col3 = df.loc[s.index, col3] #.values
    val_for_col4 = df.loc[s.index, col4] #.values

    ## apply over multiple column values
    return np.max(s) *np.min(val_for_col2)*np.max(val_for_cal3)*np.mean(val_for_col4)


### Apply to the dataframe
df.rolling('10s')['col1'].apply(cust_func, raw=False)

Note that here we can still use all functionalities from pandas rollingclass, which is particularly useful when dealing with time-related windows.

请注意，这里我们仍然可以使用pandas rolling类中的所有功能，这在处理与时间相关的窗口时特别有用。

The fact that we are passing one column and using the entire dataframe feels like a hack, but it works in practice.

我们传递一列并使用整个数据框的事实感觉像是一种黑客攻击，但它在实践中是有效的。

Answer 3

回答by Jeff

Here's another version of this question: Using rolling_apply on a DataFrame object. Use this if your function returns a Series.

这是这个问题的另一个版本：在 DataFrame 对象上使用滚动应用。如果您的函数返回一个系列，请使用它。

Since yours returns a scalar, do this.

由于您的返回标量，因此请执行此操作。

In [71]: df  = pd.DataFrame(np.random.randn(2000,2)/10000, 
                    index=pd.date_range('2001-01-01',periods=2000),
                    columns=['A','B'])

Redefine your function to return a tuple with the index you want to use and scalar value that is computed. Note that this is slightly different as we are returning the first index here (and not the normally returned last, youy could do either).

重新定义您的函数以返回一个包含您要使用的索引和计算的标量值的元组。请注意，这略有不同，因为我们在此处返回第一个索引（而不是通常返回的最后一个索引，您也可以这样做）。

In [72]: def gm(df,p):
              v =((((df['A']+df['B'])+1).cumprod())-1)*p
              return (df.index[0],v.iloc[-1])


In [73]: Series(dict([ gm(df.iloc[i:min((i+1)+50,len(df)-1)],5) for i in xrange(len(df)-50) ]))

Out[73]: 
2001-01-01    0.000218
2001-01-02   -0.001048
2001-01-03   -0.002128
2001-01-04   -0.003590
2001-01-05   -0.004636
2001-01-06   -0.005377
2001-01-07   -0.004151
2001-01-08   -0.005155
2001-01-09   -0.004019
2001-01-10   -0.004912
2001-01-11   -0.005447
2001-01-12   -0.005258
2001-01-13   -0.004437
2001-01-14   -0.004207
2001-01-15   -0.004073
...
2006-04-20   -0.006612
2006-04-21   -0.006299
2006-04-22   -0.006320
2006-04-23   -0.005690
2006-04-24   -0.004316
2006-04-25   -0.003821
2006-04-26   -0.005102
2006-04-27   -0.004760
2006-04-28   -0.003832
2006-04-29   -0.004123
2006-04-30   -0.004241
2006-05-01   -0.004684
2006-05-02   -0.002993
2006-05-03   -0.003938
2006-05-04   -0.003528
Length: 1950

Answer 4

回答by alko

All rolling_* functions works on 1d array. I'm sure one can invent some workarounds for passing 2d arrays, but in your case, you can simply precompute row-wise values for rolling evaluation:

所有rolling_* 函数都适用于一维数组。我相信有人可以发明一些传递二维数组的解决方法，但在您的情况下，您可以简单地预先计算行值以进行滚动评估：

>>> def gm(x,p):
...     return ((np.cumprod(x) - 1)*p)[-1]
...
>>> pd.rolling_apply(tmp['A']+tmp['B']+1, 50, lambda x: gm(x,5))
2001-01-01   NaN
2001-01-02   NaN
2001-01-03   NaN
2001-01-04   NaN
2001-01-05   NaN
2001-01-06   NaN
2001-01-07   NaN
2001-01-08   NaN
2001-01-09   NaN
2001-01-10   NaN
2001-01-11   NaN
2001-01-12   NaN
2001-01-13   NaN
2001-01-14   NaN
2001-01-15   NaN
...
2006-06-09   -0.000062
2006-06-10   -0.000128
2006-06-11    0.000185
2006-06-12   -0.000113
2006-06-13   -0.000962
2006-06-14   -0.001248
2006-06-15   -0.001962
2006-06-16   -0.003820
2006-06-17   -0.003412
2006-06-18   -0.002971
2006-06-19   -0.003882
2006-06-20   -0.003546
2006-06-21   -0.002226
2006-06-22   -0.002058
2006-06-23   -0.000553
Freq: D, Length: 2000

Python pandas rolling_apply 两列输入到函数中

提问by h.l.m

回答by lowtech

回答by calestini

回答by Jeff

回答by alko

相关推荐

最近更新

标签

Python pandas rolling_apply 两列输入到函数中

提问by h.l.m

回答by lowtech

回答by calestini

回答by Jeff

回答by alko

相关推荐

pandas 与系列不兼容的索引器

pandas 多列熊猫系列

在 Pandas 中读取包含列表的 csv

在 Pandas 中使用 groupby 的 TimeSeries

相关推荐

最近更新

标签