Python OLS 回归:Scikit 与 Statsmodels?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22054964/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:09:23  来源:igfitidea点击:

OLS Regression: Scikit vs. Statsmodels?

pythonscikit-learnlinear-regressionstatsmodels

提问by Nat Poor

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

简短版本:我在一些数据上使用了 scikit LinearRegression,但我习惯于 p 值,所以将数据放入 statsmodels OLS,虽然 R^2 大致相同,但变量系数都大不相同. 这让我很担心,因为最可能的问题是我在某处犯了错误,现在我对任何一个输出都没有信心(因为我可能错误地制作了一个模型,但不知道是哪个)。

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

更长的版本:因为我不知道问题出在哪里,我不确切知道要包含哪些细节,并且包含所有内容可能太多了。我也不确定是否包含代码或数据。

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

我的印象是 scikit 的 LR 和 statsmodels OLS 都应该做 OLS,据我所知 OLS 是 OLS,所以结果应该是一样的。

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

对于 scikit 的 LR,无论我是否设置 normalize=True 或 =False,结果(统计上)都是相同的,我觉得这有点奇怪。

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html(Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709That makes the warning go away but the results are exactly the same.

对于 statsmodels OLS,我使用 sklearn 的 StandardScaler 规范化数据。我添加了一列,因此它包含一个拦截(因为 scikit 的输出包含一个拦截)。更多相关信息:http: //statsmodels.sourceforge.net/devel/examples/generated/example_ols.html(添加此列并没有将变量系数改变到任何显着的程度,截距非常接近于零。)StandardScaler 没有不像我的整数不是浮点数,所以我尝试了这个:https: //github.com/scikit-learn/scikit-learn/issues/1709这使得警告消失,但结果完全相同。

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

当然,我在 sklearn 方法中使用了 5 倍 cv(每次测试和训练数据的 R^2 都是一致的),而对于 statsmodels,我只是将所有数据都扔了。

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

sklearn 和 statsmodels 的 R^2 约为 0.41(这对社会科学有好处)。这可能是一个好兆头,也可能只是一个巧合。

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

数据是对 WoW 中化身的观察(来自http://mmnet.iis.sinica.edu.tw/dl/wowah/),我打算每周使用一些不同的功能制作它。最初这是一个数据科学课程的课堂项目。

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

自变量包括一周内的观察次数(int),字符级别(int),如果在公会中(布尔值),何时看到(布尔值在工作日,工作日前夕,工作日晚,周末相同三个), dummy for character class(在收集数据时,WoW中只有8个类,因此有7个虚拟变量,原始字符串分类变量被删除)等。

The dependent variable is how many levels each character gained during that week (int).

因变量是每个角色在那一周内获得的等级 (int)。

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

有趣的是,类似变量中的一些相对顺序在 statsmodels 和 sklearn 之间保持不变。因此,“当看到时”的排名顺序是相同的,尽管加载量非常不同,而角色类假人的排名顺序是相同的,尽管加载量也非常不同。

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

我认为这个问题类似于这个问题:Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

我很擅长 Python 和统计数据,可以尝试一下,但还不够好,无法弄清楚这样的事情。我尝试阅读 sklearn 文档和 statsmodels 文档,但如果答案就在我面前,我不明白。

I would love to know:

我会很高兴知道:

  1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
  2. If I made a mistake, what is it and how to fix it?
  3. Could I have figured this out without asking here, and if so how?
  1. 哪个输出可能是准确的?(当然,如果我错过了一个 kwarg,它们可能都是。)
  2. 如果我犯了一个错误,它是什么以及如何解决它?
  3. 我能在不问这里的情况下弄清楚这一点吗,如果是这样,怎么办?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

我知道这个问题有一些相当模糊的部分(没有代码,没有数据,没有输出),但我认为它更多地是关于两个包的一般过程。当然,一个似乎更多的统计数据,一个似乎更多的机器学习,但它们都是 OLS,所以我不明白为什么输出不一样。

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

(我什至尝试了其他一些 OLS 调用来进行三角测量,一个给出了低得多的 R^2,一个循环了 5 分钟,我杀死了它,另一个崩溃了。)

Thanks!

谢谢!

采纳答案by Vincent

It sounds like you are not feeding the same matrix of regressors Xto both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

听起来您没有X向两个过程提供相同的回归矩阵(但请参见下文)。这是一个示例,向您展示需要为 sklearn 和 statsmodels 使用哪些选项才能产生相同的结果。

import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Generate artificial data (2 regressors + constant)
nobs = 100 
X = np.random.random((nobs, 2)) 
X = sm.add_constant(X)
beta = [1, .1, .5] 
e = np.random.random(nobs)
y = np.dot(X, beta) + e 

# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

正如评论者所建议的那样,即使您为两个程序提供相同的 X,X 也可能没有完整的列排名,并且它们 sm/sk 可能会在后台采取(不同的)操作来使 OLS 计算通过(即删除不同的列)。

I recommend you use pandasand patsyto take care of this:

我建议您使用pandaspatsy注意这一点:

import pandas as pd
from patsy import dmatrices

dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodelsformula interface:

或者,或者,statsmodels公式界面:

import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

编辑:这个例子可能有用:http: //statsmodels.sourceforge.net/devel/example_formulas.html

回答by Palu

i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

我只是想在这里补充一点,就 sklearn 而言,它并没有在幕后使用 OLS 方法进行线性回归。由于 sklearn 来自数据挖掘/机器学习领域,他们喜欢使用最速下降梯度算法。这是一种对初始条件等敏感的数值方法,而 OLS 是一种解析封闭形式的方法,因此应该期待差异。因此 statsmodels 来自经典统计领域,因此他们将使用 OLS 技术。所以来自 2 个不同库的两个线性回归之间存在差异

回答by Sarah

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

如果您使用 statsmodels,我强烈建议您改用 statsmodels 公式界面。您将使用 statsmodels 公式界面从 OLS 获得与从 sklearn.linear_model.LinearRegression、R、SAS 或 Excel 获得的相同的旧结果。

smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())

When in doubt, please

如有疑问,请

  1. try reading the source code
  2. try a different language for benchmark, or
  3. try OLS from scratch, which is basic linear algebra.
  1. 尝试阅读源代码
  2. 尝试不同的语言进行基准测试,或
  3. 从头开始尝试 OLS,这是基本的线性代数。