pandas statsmodels 中的多元线性回归:ValueError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29186436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:04:47  来源:igfitidea点击:

Multiple linear regression in pandas statsmodels: ValueError

pythonpandas

提问by alkamid

Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv

数据:https: //courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv

I know how to fit these data to a multiple linear regression model using statsmodels.formula.api:

我知道如何使用以下方法将这些数据拟合到多元线性回归模型中statsmodels.formula.api

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()

However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:

但是,我发现这个类似 R 的公式符号很尴尬,我想使用通常的 Pandas 语法:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")    
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

Using the second method I get the following error:

使用第二种方法我得到以下错误:

ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)

Why does it happen and how to fix it?

为什么会发生以及如何解决?

回答by unutbu

When using sm.OLS(y, X), yis the dependent variable, and Xare the independent variables.

使用时sm.OLS(y, X)y是因变量,X是自变量。

In the formula W ~ PTS + oppPTS, Wis the dependent variable and PTSand oppPTSare the independent variables.

在公式中W ~ PTS + oppPTSW是因变量,PTSoppPTS是自变量。

Therefore, use

因此,使用

y = NBA['W']
X = NBA[['PTS', 'oppPTS']]

instead of

代替

X = NBA['W']
y = NBA[['PTS', 'oppPTS']]


import pandas as pd
import statsmodels.api as sm

NBA = pd.read_csv("NBA_train.csv")    
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

yields

产量

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     6799.
Date:                Sat, 21 Mar 2015   Prob (F-statistic):               0.00
Time:                        14:58:05   Log-Likelihood:                -2118.0
No. Observations:                 835   AIC:                             4242.
Df Residuals:                     832   BIC:                             4256.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         41.3048      1.610     25.652      0.000        38.144    44.465
PTS            0.0326      0.000    109.600      0.000         0.032     0.033
oppPTS        -0.0326      0.000   -110.951      0.000        -0.033    -0.032
==============================================================================
Omnibus:                        1.026   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.599   Jarque-Bera (JB):                0.984
Skew:                           0.084   Prob(JB):                        0.612
Kurtosis:                       3.009   Cond. No.                     1.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.