Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30650257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
OLS using statsmodel.formula.api versus statsmodel.api
提问by Chetan Prabhu
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
谁能向我解释 statsmodel.formula.api 中的 ols 与 statsmodel.api 中的 ols 之间的区别?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
使用 ISLR 文本中的广告数据,我使用两者运行了 ols,并得到了不同的结果。然后我与 scikit-learn 的 LinearRegression 进行了比较。
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
输出如下:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
statsmodel.api 方法从 statsmodel.formula.api 和 scikit-learn 方法返回不同的 TV 参数。
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
statsmodel.api 运行什么样的 ols 算法会产生不同的结果?有没有人有可以帮助回答这个问题的文档链接?
采纳答案by stellasia
The difference is due to the presence of intercept or not:
区别在于有没有拦截:
- in
statsmodels.formula.api
, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted in
statsmodels.api
, you have to add a constant yourself (see the documentation here). Try using add_constantfrom statsmodels.apix1 = sm.add_constant(x1)
- in
statsmodels.formula.api
,类似于 R 方法,一个常数会自动添加到您的数据中,并在拟合中截距 在 中
statsmodels.api
,您必须自己添加一个常量(请参阅此处的文档)。尝试使用statsmodels.api 中的 add_constantx1 = sm.add_constant(x1)
回答by Brad Solomon
Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
今天遇到了这个问题,想详细说明@stellasia 的答案,因为 statsmodels 文档可能有点含糊不清。
Unless you are using actual R-style string-formulaswhen instantiating OLS
, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api
and plain statsmodels.api
. @Chetan is using R-style formatting here (formula='Sales ~ TV'
), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
除非您在实例化 时使用实际的 R 样式字符串公式,否则您OLS
需要在statsmodels.formulas.api
和 plain下添加一个常量(字面意思是一列 1)statsmodels.api
。@Chetan 在这里使用了 R 风格的格式 ( formula='Sales ~ TV'
),所以他不会遇到这种微妙之处,但对于有一些 Python 知识但没有 R 背景的人来说,这可能会非常混乱。
Furthermore it doesn't matterwhether you specify the hasconst
parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst
is ignored even though it is supposed to
此外,在构建模型时是否指定参数并不重要hasconst
。(这有点傻。)换句话说,除非您使用 R 样式的字符串公式,否则hasconst
即使它应该被忽略
[Indicate] whether the RHS includes a user-supplied constant
[指示] RHS 是否包含用户提供的常量
because, in the footnotes
因为,在脚注中
No constant is added by the model unless you are using formulas.
除非您使用公式,否则模型不会添加常数。
The example below shows that both .formulas.api
and .api
will require a user-added column vector of 1s if not using R-style string formulas.
下面的示例显示,如果不使用 R 样式的字符串公式,则.formulas.api
和.api
都需要用户添加的 1 列向量。
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x
and y
into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
现在将x
并y
放入 Excel 并运行数据>数据分析>回归,确保未选中“常量为零”。您将获得以下系数:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x
, not x_with_ones
, in either statsmodels.formula.api
or statsmodels.api
with hasconst
set to None
, True
, or False
. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
现在,尝试上运行该回归x
,而不是x_with_ones
在任一statsmodels.formula.api
或statsmodels.api
与hasconst
设置为None
,True
或False
。您会看到,在这 6 个场景中的每一个中,都没有返回拦截。(只有 2 个参数。)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0
s added to x
. You can use smf
here but it's really not necessary if you're not using formulas.
现在,将1.0
s的列向量添加到x
. 您可以smf
在此处使用,但如果您不使用公式,则实际上没有必要。
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]