Python pandas:如何按组运行多个单变量回归
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24544805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas: how to run multiple univariate regression by group
提问by
Suppose I have a DataFramewith one column of yvariable and many columns of xvariables. I would like to be able to run multiple univariate regressions of yvs x1, yvs x2, ..., etc, and store the predictions back into the DataFrame. Also I need to do this by a group variable.
假设我有DataFrame一列y变量和多列x变量。我希望能够运行yvs x1、yvs x2、... 等的多个单变量回归,并将预测存储回DataFrame. 我也需要通过组变量来做到这一点。
import statsmodels.api as sm
import pandas as pd
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return sm.OLS(y, x).fit().predict()
df.groupby('grp').apply(ols_res) # This does not work
The code above obviously does not work. It is not clear to me how to correctly pass the fixed yto the function while having applyiterating through the xcolumns(x1, x2, ...). I suspect there might be a very clever one-line solution to do this. Any idea?
上面的代码显然不起作用。这是我不清楚如何正确通过固定y的功能,同时具有apply迭代通过x列(x1,x2,...)。我怀疑可能有一个非常聪明的单行解决方案来做到这一点。任何的想法?
采纳答案by JaminSore
The function you pass to applymust take a pandas.DataFrameas a first argument. You can pass additional keyword or positional arguments to applythat get passed to the applied function. So your example would work with a small modification. Change ols_resto
您传递给的函数apply必须将 apandas.DataFrame作为第一个参数。您可以将其他关键字或位置参数apply传递给传递给应用函数的参数。因此,您的示例只需稍作修改即可。更改ols_res为
def ols_res(df, xcols, ycol):
return sm.OLS(df[ycol], df[xcols]).fit().predict()
Then, you can use groupbyand applylike this
然后,您可以使用groupby并apply喜欢这个
df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')
Or
或者
df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')
EDIT
编辑
The above code does notrun multiple univariateregressions. Instead, it runs one multivariateregression per group. With (another) slight modification it will, however.
上面的代码没有运行多个单变量回归。相反,它对每组运行一个多元回归。然而,通过(另一个)轻微的修改,它会。
def ols_res(df, xcols, ycol):
return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})
EDIT 2
编辑 2
Although, the above solution works, I think the following is a little more pandas-y
虽然,上述解决方案有效,但我认为以下是更多的 pandas-y
import statsmodels.api as sm
import pandas as pd
import numpy as np
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return pd.Series(sm.OLS(y, x).fit().predict())
df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))
For some reason, if I define ols_res()as it was originally, the resultant DataFramedoesn't have the group label in the index.
出于某种原因,如果我ols_res()按原样定义,结果DataFrame在索引中没有组标签。

