Python pandas：如何按组运行多个单变量回归

Question

提问by

Suppose I have a DataFramewith one column of yvariable and many columns of xvariables. I would like to be able to run multiple univariate regressions of yvs x1, yvs x2, ..., etc, and store the predictions back into the DataFrame. Also I need to do this by a group variable.

假设我有DataFrame一列y变量和多列x变量。我希望能够运行yvs x1、yvs x2、... 等的多个单变量回归，并将预测存储回DataFrame. 我也需要通过组变量来做到这一点。

import statsmodels.api as sm
import pandas as pd

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return sm.OLS(y, x).fit().predict()

df.groupby('grp').apply(ols_res) # This does not work

The code above obviously does not work. It is not clear to me how to correctly pass the fixed yto the function while having applyiterating through the xcolumns(x1, x2, ...). I suspect there might be a very clever one-line solution to do this. Any idea?

上面的代码显然不起作用。这是我不清楚如何正确通过固定y的功能，同时具有apply迭代通过x列（x1，x2，...）。我怀疑可能有一个非常聪明的单行解决方案来做到这一点。任何的想法？

Answer 1

采纳答案by JaminSore

The function you pass to applymust take a pandas.DataFrameas a first argument. You can pass additional keyword or positional arguments to applythat get passed to the applied function. So your example would work with a small modification. Change ols_resto

您传递给的函数apply必须将 apandas.DataFrame作为第一个参数。您可以将其他关键字或位置参数apply传递给传递给应用函数的参数。因此，您的示例只需稍作修改即可。更改ols_res为

def ols_res(df, xcols,  ycol):
    return sm.OLS(df[ycol], df[xcols]).fit().predict()

Then, you can use groupbyand applylike this

然后，您可以使用groupby并apply喜欢这个

df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')

Or

或者

df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')

EDIT

编辑

The above code does notrun multiple univariateregressions. Instead, it runs one multivariateregression per group. With (another) slight modification it will, however.

上面的代码没有运行多个单变量回归。相反，它对每组运行一个多元回归。然而，通过（另一个）轻微的修改，它会。

def ols_res(df, xcols,  ycol):
    return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})

EDIT 2

编辑 2

Although, the above solution works, I think the following is a little more pandas-y

虽然，上述解决方案有效，但我认为以下是更多的 pandas-y

import statsmodels.api as sm
import pandas as pd
import numpy as np

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return pd.Series(sm.OLS(y, x).fit().predict())

df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))

For some reason, if I define ols_res()as it was originally, the resultant DataFramedoesn't have the group label in the index.

出于某种原因，如果我ols_res()按原样定义，结果DataFrame在索引中没有组标签。

Python pandas：如何按组运行多个单变量回归

提问by

采纳答案by JaminSore

相关推荐

最近更新

标签

Python pandas：如何按组运行多个单变量回归

提问by

采纳答案by JaminSore

相关推荐

在函数内迭代 Pandas 系列的行

pandas 尝试使用函数中定义的数据帧名称时发生意外的 NameError

Pandas：条件组特定计算

仅在 Pandas 中转换为年份的 Python 清理日期

相关推荐

最近更新

标签