pandas 将预测值和残差附加到熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32101233/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:47:15  来源:igfitidea点击:

Appending predicted values and residuals to pandas dataframe

pythonpandasdataframepredictionstatsmodels

提问by Uncle Milton

It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar questionasked about a year-and-a-half ago, but it wasn't really answered.

将运行回归的预测值和残差作为不同的列附加到数据帧上是一种有用且常见的做法。我是Pandas的新手,在执行这个非常简单的操作时遇到了麻烦。我知道我错过了一些明显的东西。有一个非常类似的问题询问了一年和半前,但它并没有真正回答。

The dataframe currently looks something like this:

数据框目前看起来像这样:

y               x1           x2   
880.37          3.17         23
716.20          4.76         26
974.79          4.17         73
322.80          8.70         72
1054.25         11.45        16

And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:

我想要的只是返回一个数据帧,该数据帧具有 y = x1 + x2 的每个观察的预测值和残差:

y               x1           x2       y_hat         res
880.37          3.17         23       840.27        40.10
716.20          4.76         26       752.60        -36.40
974.79          4.17         73       877.49        97.30
322.80          8.70         72       348.50        -25.70
1054.25         11.45        16       815.15        239.10

I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!

我已经尝试使用 statsmodels 和 pandas 来解决这个问题,但一直无法解决。提前致谢!

回答by Josef

Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.

这是亚历山大使用来自 statsmodels 的 OLS 模型而不是 pandas ols 模型的答案的变体。我们可以使用模型的公式或数组/DataFrame 接口。

fittedvaluesand residare pandas Series with the correct index. predictdoes not return a pandas Series.

fittedvalues并且resid是Pandas系列与正确的索引。 predict不返回Pandas系列。

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
                   index=np.arange(10, 20, 2))

result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid


result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid

# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)

print(df)

       x1  x2        y        yhat       resid       yhat2      resid2  \
10   3.17  23   880.37  923.949309  -43.579309  923.949309  -43.579309   
12   4.76  26   716.20  890.732201 -174.532201  890.732201 -174.532201   
14   4.17  73   974.79  656.155079  318.634921  656.155079  318.634921   
16   8.70  72   322.80  610.510952 -287.710952  610.510952 -287.710952   
18  11.45  16  1054.25  867.062458  187.187542  867.062458  187.187542   

     predicted  
10  923.949309  
12  890.732201  
14  656.155079  
16  610.510952  
18  867.062458  

As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:

作为预览,statsmodels master(0.7)中的模型结果中有一个扩展的预测方法,但API尚未确定:

>>> print(result.get_prediction().summary_frame())
          mean     mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
10  923.949309  268.931939    -233.171432    2081.070051   -991.466820   
12  890.732201  211.945165     -21.194241    1802.658643   -887.328646   
14  656.155079  269.136102    -501.844105    1814.154263  -1259.791854   
16  610.510952  282.182030    -603.620329    1824.642233  -1339.874985   
18  867.062458  329.017262    -548.584564    2282.709481  -1214.750941   

    obs_ci_upper  
10   2839.365439  
12   2668.793048  
14   2572.102012  
16   2560.896890  
18   2948.875858  

回答by Alexander

This should be self explanatory.

这应该是不言自明的。

import pandas as pd

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid

>>> df
      x1  x2        y       y_hat         res
0   3.17  23   880.37  923.949309  -43.579309
1   4.76  26   716.20  890.732201 -174.532201
2   4.17  73   974.79  656.155079  318.634921
3   8.70  72   322.80  610.510952 -287.710952
4  11.45  16  1054.25  867.062458  187.187542

回答by Andy Kubiak

So, it's polite to form your questions such that it's easy for contributors to run your code.

因此,形成您的问题是礼貌的,以便贡献者可以轻松地运行您的代码。

import pandas as pd

y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]

df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col

Then calling df.head()yields:

然后调用df.head()产量:

         y     x1  x2
0   880.37   3.17  23
1   716.20   4.76  26
2   974.79   4.17  73
3   322.80   8.70  72
4  1054.25  11.45  16

Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data:

现在对于您的问题,添加具有计算值的列是相当简单的,尽管我不同意您的示例数据:

df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']

For me, these yield:

对我来说,这些收益:

         y     x1  x2  y_hat      res
0   880.37   3.17  23  26.17   854.20
1   716.20   4.76  26  30.76   685.44
2   974.79   4.17  73  77.17   897.62
3   322.80   8.70  72  80.70   242.10
4  1054.25  11.45  16  27.45  1026.80

Hope this helps!

希望这可以帮助!