pandas Python statsmodels OLS 和 R 的 lm 的区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11495051/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference in Python statsmodels OLS and R's lm
提问by Skylar Saveland
I'm not sure why I'm getting slightly different results for a simple OLS, depending on whether I go through panda's experimental rpy interfaceto do the regression in Ror whether I use statsmodelsin Python.
我不确定为什么我得到的简单 OLS 的结果略有不同,这取决于我是通过panda 的实验性 rpy 接口进行回归R还是在 Python 中使用statsmodels。
import pandas
from rpy2.robjects import r
from functools import partial
loadcsv = partial(pandas.DataFrame.from_csv,
index_col="seqn", parse_dates=False)
demoq = loadcsv("csv/DEMO.csv")
rxq = loadcsv("csv/quest/RXQ_RX.csv")
num_rx = {}
for seqn, num in rxq.rxd295.iteritems():
try:
val = int(num)
except ValueError:
val = 0
num_rx[seqn] = val
series = pandas.Series(num_rx, name="num_rx")
demoq = demoq.join(series)
import pandas.rpy.common as com
df = com.convert_to_r_dataframe(demoq)
r.assign("demoq", df)
r('lmout <- lm(demoq$num_rx ~ demoq$ridageyr)') # run the regression
r('print(summary(lmout))') # print from R
From R, I get the following summary:
从R,我得到以下摘要:
Call:
lm(formula = demoq$num_rx ~ demoq$ridageyr)
Residuals:
Min 1Q Median 3Q Max
-2.9086 -0.6908 -0.2940 0.1358 15.7003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1358216 0.0241399 -5.626 1.89e-08 ***
demoq$ridageyr 0.0358161 0.0006232 57.469 < 2e-16 ***
---
Signif. codes: 0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1
Residual standard error: 1.545 on 9963 degrees of freedom
Multiple R-squared: 0.249, Adjusted R-squared: 0.2489
F-statistic: 3303 on 1 and 9963 DF, p-value: < 2.2e-16
Using statsmodels.apito do the OLS:
用statsmodels.api做OLS:
import statsmodels.api as sm
results = sm.OLS(demoq.num_rx, demoq.ridageyr).fit()
results.summary()
The results are similar to R's output but not the same:
结果类似于 R 的输出但不相同:
OLS Regression Results
Adj. R-squared: 0.247
Log-Likelihood: -18488.
No. Observations: 9965 AIC: 3.698e+04
Df Residuals: 9964 BIC: 3.698e+04
coef std err t P>|t| [95.0% Conf. Int.]
ridageyr 0.0331 0.000 82.787 0.000 0.032 0.034
The install process is a a bit cumbersome. But, there is an ipython notebookhere, that can reproduce the inconsistency.
安装过程有点麻烦。但是,有一个IPython的笔记本电脑在这里,可以重现不一致。
采纳答案by Dirk Eddelbuettel
Looks like Python does not add an intercept by default to your expression, whereas R does when you use the formula interface..
看起来 Python 默认不会为您的表达式添加拦截,而 R 会在您使用公式接口时添加。
This means you did fit two different models. Try
这意味着您确实适合两个不同的模型。尝试
lm( y ~ x - 1, data)
in R to exclude the intercept, or in your case and with somewhat more standard notation
在 R 中排除拦截,或者在您的情况下并使用更标准的符号
lm(num_rx ~ ridageyr - 1, data=demoq)
回答by herrfz
Note that you can still use olsfrom statsmodels.formula.api:
请注意,您仍然可以使用olsfrom statsmodels.formula.api:
from statsmodels.formula.api import ols
results = ols('num_rx ~ ridageyr', demoq).fit()
results.summary()
I think it uses patsyin the backend to translate the formula expression, and intercept is added automatically.
我认为它patsy在后端使用来翻译公式表达式,并自动添加拦截。

