Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS

Question

提问by Chetan Prabhu

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?

谁能向我解释 statsmodel.formula.api 中的 ols 与 statsmodel.api 中的 ols 之间的区别？

Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.

使用 ISLR 文本中的广告数据，我使用两者运行了 ols，并得到了不同的结果。然后我与 scikit-learn 的 LinearRegression 进行了比较。

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

df = pd.read_csv("C:\...\Advertising.csv")

x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]

print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params

print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params

print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_

The output is as follows:

输出如下：

Statsmodel.Formula.Api Method
Intercept    7.032594
TV           0.047537
dtype: float64

Statsmodel.Api Method
TV    0.08325
dtype: float64

Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]

The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.

statsmodel.api 方法从 statsmodel.formula.api 和 scikit-learn 方法返回不同的 TV 参数。

What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?

statsmodel.api 运行什么样的 ols 算法会产生不同的结果？有没有人有可以帮助回答这个问题的文档链接？

Answer 1

采纳答案by stellasia

The difference is due to the presence of intercept or not:

区别在于有没有拦截：

in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constantfrom statsmodels.api
```
x1 = sm.add_constant(x1)
```

in statsmodels.formula.api，类似于 R 方法，一个常数会自动添加到您的数据中，并在拟合中截距
在中statsmodels.api，您必须自己添加一个常量（请参阅此处的文档）。尝试使用statsmodels.api 中的 add_constant
```
x1 = sm.add_constant(x1)
```

Answer 2

回答by Brad Solomon

Came across this issue today and wanted to elaborate on @stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.

今天遇到了这个问题，想详细说明@stellasia 的答案，因为 statsmodels 文档可能有点含糊不清。

Unless you are using actual R-style string-formulaswhen instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.apiand plain statsmodels.api. @Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.

除非您在实例化时使用实际的 R 样式字符串公式，否则您OLS需要在statsmodels.formulas.api和 plain下添加一个常量（字面意思是一列 1）statsmodels.api。@Chetan 在这里使用了 R 风格的格式 ( formula='Sales ~ TV')，所以他不会遇到这种微妙之处，但对于有一些 Python 知识但没有 R 背景的人来说，这可能会非常混乱。

Furthermore it doesn't matterwhether you specify the hasconstparameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconstis ignored even though it is supposed to

此外，在构建模型时是否指定参数并不重要hasconst。（这有点傻。）换句话说，除非您使用 R 样式的字符串公式，否则hasconst即使它应该被忽略

[Indicate] whether the RHS includes a user-supplied constant

[指示] RHS 是否包含用户提供的常量

because, in the footnotes

因为，在脚注中

No constant is added by the model unless you are using formulas.

除非您使用公式，否则模型不会添加常数。

The example below shows that both .formulas.apiand .apiwill require a user-added column vector of 1s if not using R-style string formulas.

下面的示例显示，如果不使用 R 样式的字符串公式，则.formulas.api和.api都需要用户添加的 1 列向量。

# Generate some relational data
np.random.seed(123)
nobs = 25 
x = np.random.random((nobs, 2)) 
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1] 
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e

Now throw xand yinto Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:

现在将x并y放入 Excel 并运行数据>数据分析>回归，确保未选中“常量为零”。您将获得以下系数：

Intercept       1.497761024
X Variable 1    0.012073045
X Variable 2    0.623936056

Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.apior statsmodels.apiwith hasconstset to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)

现在，尝试上运行该回归x，而不是x_with_ones在任一statsmodels.formula.api或statsmodels.api与hasconst设置为None，True或False。您会看到，在这 6 个场景中的每一个中，都没有返回拦截。（只有 2 个参数。）

import statsmodels.formula.api as smf
import statsmodels.api as sm

print('smf models')
print('-' * 10)
for hc in [None, True, False]:
    model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
    print(model.params)

# smf models
# ----------
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]
# [ 1.46852293  1.8558273 ]

Now running things correctly with a column vector of 1.0s added to x. You can use smfhere but it's really not necessary if you're not using formulas.

现在，将1.0s的列向量添加到x. 您可以smf在此处使用，但如果您不使用公式，则实际上没有必要。

print('sm models')
print('-' * 10)
for hc in [None, True, False]:
    model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
    print(model.params)

# sm models
# ----------
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]
# [ 0.01207304  0.62393606  1.49776102]

Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS

提问by Chetan Prabhu

采纳答案by stellasia

回答by Brad Solomon

相关推荐

最近更新

标签

Python 使用 statsmodel.formula.api 与 statsmodel.api 的 OLS

提问by Chetan Prabhu

采纳答案by stellasia

回答by Brad Solomon

相关推荐

Python 无法导入 MongoClient

python tkinter树获取选定的项目值

删除 Python 用户警告

Python 和 JIRA 从特定问题中获取字段

相关推荐

最近更新

标签