Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30650257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
OLS using statsmodel.formula.api versus statsmodel.api
提问by Chetan Prabhu
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
谁能向我解释 statsmodel.formula.api 中的 ols 与 statsmodel.api 中的 ols 之间的区别?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
使用 ISLR 文本中的广告数据,我使用两者运行了 ols,并得到了不同的结果。然后我与 scikit-learn 的 LinearRegression 进行了比较。
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
输出如下:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
statsmodel.api 方法从 statsmodel.formula.api 和 scikit-learn 方法返回不同的 TV 参数。
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
statsmodel.api 运行什么样的 ols 算法会产生不同的结果?有没有人有可以帮助回答这个问题的文档链接?
采纳答案by stellasia
The difference is due to the presence of intercept or not:
区别在于有没有拦截:
- in
statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted in
statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constantfrom statsmodels.apix1 = sm.add_constant(x1)
- in
statsmodels.formula.api,类似于 R 方法,一个常数会自动添加到您的数据中,并在拟合中截距 在 中
statsmodels.api,您必须自己添加一个常量(请参阅此处的文档)。尝试使用statsmodels.api 中的 add_constantx1 = sm.add_constant(x1)
回答by Brad Solomon
Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
今天遇到了这个问题,想详细说明@stellasia 的答案,因为 statsmodels 文档可能有点含糊不清。
Unless you are using actual R-style string-formulaswhen instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.apiand plain statsmodels.api. @Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
除非您在实例化 时使用实际的 R 样式字符串公式,否则您OLS需要在statsmodels.formulas.api和 plain下添加一个常量(字面意思是一列 1)statsmodels.api。@Chetan 在这里使用了 R 风格的格式 ( formula='Sales ~ TV'),所以他不会遇到这种微妙之处,但对于有一些 Python 知识但没有 R 背景的人来说,这可能会非常混乱。
Furthermore it doesn't matterwhether you specify the hasconstparameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconstis ignored even though it is supposed to
此外,在构建模型时是否指定参数并不重要hasconst。(这有点傻。)换句话说,除非您使用 R 样式的字符串公式,否则hasconst即使它应该被忽略
[Indicate] whether the RHS includes a user-supplied constant
[指示] RHS 是否包含用户提供的常量
because, in the footnotes
因为,在脚注中
No constant is added by the model unless you are using formulas.
除非您使用公式,否则模型不会添加常数。
The example below shows that both .formulas.apiand .apiwill require a user-added column vector of 1s if not using R-style string formulas.
下面的示例显示,如果不使用 R 样式的字符串公式,则.formulas.api和.api都需要用户添加的 1 列向量。
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw xand yinto Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
现在将x并y放入 Excel 并运行数据>数据分析>回归,确保未选中“常量为零”。您将获得以下系数:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.apior statsmodels.apiwith hasconstset to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
现在,尝试上运行该回归x,而不是x_with_ones在任一statsmodels.formula.api或statsmodels.api与hasconst设置为None,True或False。您会看到,在这 6 个场景中的每一个中,都没有返回拦截。(只有 2 个参数。)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smfhere but it's really not necessary if you're not using formulas.
现在,将1.0s的列向量添加到x. 您可以smf在此处使用,但如果您不使用公式,则实际上没有必要。
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]

