Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30650257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:45:40  来源:igfitidea点击:

OLS using statsmodel.formula.api versus statsmodel.api

pythonlinear-regression

提问by Chetan Prabhu

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?

谁能向我解释 statsmodel.formula.api 中的 ols 与 statsmodel.api 中的 ols 之间的区别?

Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.

使用 ISLR 文本中的广告数据,我使用两者运行了 ols,并得到了不同的结果。然后我与 scikit-learn 的 LinearRegression 进行了比较。

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

df = pd.read_csv("C:\...\Advertising.csv")

x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]

print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params

print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params

print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_

The output is as follows:

输出如下:

Statsmodel.Formula.Api Method
Intercept    7.032594
TV           0.047537
dtype: float64

Statsmodel.Api Method
TV    0.08325
dtype: float64

Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]

The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.

statsmodel.api 方法从 statsmodel.formula.api 和 scikit-learn 方法返回不同的 TV 参数。

What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?

statsmodel.api 运行什么样的 ols 算法会产生不同的结果?有没有人有可以帮助回答这个问题的文档链接?

采纳答案by stellasia

The difference is due to the presence of intercept or not:

区别在于有没有拦截:

  • in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
  • in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constantfrom statsmodels.api

    x1 = sm.add_constant(x1)
    
  • in statsmodels.formula.api,类似于 R 方法,一个常数会自动添加到您的数据中,并在拟合中截距
  • 在 中statsmodels.api,您必须自己添加一个常量(请参阅此处的文档)。尝试使用statsmodels.api 中的 add_constant

    x1 = sm.add_constant(x1)
    

回答by Brad Solomon

Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.

今天遇到了这个问题,想详细说明@stellasia 的答案,因为 statsmodels 文档可能有点含糊不清。

Unless you are using actual R-style string-formulaswhen instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.apiand plain statsmodels.api. @Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.

除非您在实例化 时使用实际的 R 样式字符串公式否则您OLS需要在statsmodels.formulas.api和 plain下添加一个常量(字面意思是一列 1)statsmodels.api。@Chetan 在这里使用了 R 风格的格式 ( formula='Sales ~ TV'),所以他不会遇到这种微妙之处,但对于有一些 Python 知识但没有 R 背景的人来说,这可能会非常混乱。

Furthermore it doesn't matterwhether you specify the hasconstparameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconstis ignored even though it is supposed to

此外,在构建模型时是否指定参数并不重要hasconst。(这有点傻。)换句话说,除非您使用 R 样式的字符串公式,否则hasconst即使它应该被忽略

[Indicate] whether the RHS includes a user-supplied constant

[指示] RHS 是否包含用户提供的常量

because, in the footnotes

因为,在脚注中

No constant is added by the model unless you are using formulas.

除非您使用公式,否则模型不会添加常数。

The example below shows that both .formulas.apiand .apiwill require a user-added column vector of 1s if not using R-style string formulas.

下面的示例显示,如果不使用 R 样式的字符串公式,则.formulas.api.api都需要用户添加的 1 列向量。

# Generate some relational data
np.random.seed(123)
nobs = 25 
x = np.random.random((nobs, 2)) 
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1] 
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e

Now throw xand yinto Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:

现在将xy放入 Excel 并运行数据>数据分析>回归,确保未选中“常量为零”。您将获得以下系数:

Intercept       1.497761024
X Variable 1    0.012073045
X Variable 2    0.623936056

Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.apior statsmodels.apiwith hasconstset to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)

现在,尝试上运行该回归x,而不是x_with_ones在任一statsmodels.formula.apistatsmodels.apihasconst设置为NoneTrueFalse。您会看到,在这 6 个场景中的每一个中,都没有返回拦截。(只有 2 个参数。)

import statsmodels.formula.api as smf
import statsmodels.api as sm

print('smf models')
print('-' * 10)
for hc in [None, True, False]:
    model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
    print(model.params)

# smf models
# ----------
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]

Now running things correctly with a column vector of 1.0s added to x. You can use smfhere but it's really not necessary if you're not using formulas.

现在,将1.0s的列向量添加到x. 您可以smf在此处使用,但如果您不使用公式,则实际上没有必要。

print('sm models')
print('-' * 10)
for hc in [None, True, False]:
    model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
    print(model.params)

# sm models
# ----------
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]