pandas Python statsmodels OLS 和 R 的 lm 的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11495051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 15:46:27  来源:igfitidea点击:

Difference in Python statsmodels OLS and R's lm

pythonrpandasrpy2statsmodels

提问by Skylar Saveland

I'm not sure why I'm getting slightly different results for a simple OLS, depending on whether I go through panda's experimental rpy interfaceto do the regression in Ror whether I use statsmodelsin Python.

我不确定为什么我得到的简单 OLS 的结果略有不同,这取决于我是通过panda 的实验性 rpy 接口进行回归R还是在 Python 中使用statsmodels

import pandas
from rpy2.robjects import r

from functools import partial

loadcsv = partial(pandas.DataFrame.from_csv,
                  index_col="seqn", parse_dates=False)

demoq = loadcsv("csv/DEMO.csv")
rxq = loadcsv("csv/quest/RXQ_RX.csv")

num_rx = {}
for seqn, num in rxq.rxd295.iteritems():
    try:
        val = int(num)
    except ValueError:
        val = 0
    num_rx[seqn] = val

series = pandas.Series(num_rx, name="num_rx")
demoq = demoq.join(series)

import pandas.rpy.common as com
df = com.convert_to_r_dataframe(demoq)
r.assign("demoq", df)
r('lmout <- lm(demoq$num_rx ~ demoq$ridageyr)')  # run the regression
r('print(summary(lmout))')  # print from R

From R, I get the following summary:

R,我得到以下摘要:

Call:
lm(formula = demoq$num_rx ~ demoq$ridageyr)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9086 -0.6908 -0.2940  0.1358 15.7003 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.1358216  0.0241399  -5.626 1.89e-08 ***
demoq$ridageyr  0.0358161  0.0006232  57.469  < 2e-16 ***
---
Signif. codes:  0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1 

Residual standard error: 1.545 on 9963 degrees of freedom
Multiple R-squared: 0.249,  Adjusted R-squared: 0.2489 
F-statistic:  3303 on 1 and 9963 DF,  p-value: < 2.2e-16

Using statsmodels.apito do the OLS:

statsmodels.api做OLS:

import statsmodels.api as sm
results = sm.OLS(demoq.num_rx, demoq.ridageyr).fit()
results.summary()

The results are similar to R's output but not the same:

结果类似于 R 的输出但不相同:

OLS Regression Results
Adj. R-squared:  0.247
Log-Likelihood:  -18488.
No. Observations:    9965    AIC:   3.698e+04
Df Residuals:    9964    BIC:   3.698e+04
             coef   std err  t     P>|t|    [95.0% Conf. Int.]
ridageyr     0.0331  0.000   82.787    0.000        0.032 0.034

The install process is a a bit cumbersome. But, there is an ipython notebookhere, that can reproduce the inconsistency.

安装过程有点麻烦。但是,有一个IPython的笔记本电脑在这里,可以重现不一致。

采纳答案by Dirk Eddelbuettel

Looks like Python does not add an intercept by default to your expression, whereas R does when you use the formula interface..

看起来 Python 默认不会为您的表达式添加拦截,而 R 会在您使用公式接口时添加。

This means you did fit two different models. Try

这意味着您确实适合两个不同的模型。尝试

lm( y ~ x - 1, data)

in R to exclude the intercept, or in your case and with somewhat more standard notation

在 R 中排除拦截,或者在您的情况下并使用更标准的符号

lm(num_rx ~ ridageyr - 1, data=demoq)

回答by herrfz

Note that you can still use olsfrom statsmodels.formula.api:

请注意,您仍然可以使用olsfrom statsmodels.formula.api

from statsmodels.formula.api import ols

results = ols('num_rx ~ ridageyr', demoq).fit()
results.summary()

I think it uses patsyin the backend to translate the formula expression, and intercept is added automatically.

我认为它patsy在后端使用来翻译公式表达式,并自动添加拦截。