pandas statsmodels 中的多元线性回归:ValueError
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29186436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multiple linear regression in pandas statsmodels: ValueError
提问by alkamid
Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv
数据:https: //courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv
I know how to fit these data to a multiple linear regression model using statsmodels.formula.api:
我知道如何使用以下方法将这些数据拟合到多元线性回归模型中statsmodels.formula.api:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()
However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:
但是,我发现这个类似 R 的公式符号很尴尬,我想使用通常的 Pandas 语法:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
Using the second method I get the following error:
使用第二种方法我得到以下错误:
ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)
Why does it happen and how to fix it?
为什么会发生以及如何解决?
回答by unutbu
When using sm.OLS(y, X), yis the dependent variable, and Xare the
independent variables.
使用时sm.OLS(y, X),y是因变量,X是自变量。
In the formula W ~ PTS + oppPTS, Wis the dependent variable and PTSand oppPTSare the independent variables.
在公式中W ~ PTS + oppPTS,W是因变量,PTS和oppPTS是自变量。
Therefore, use
因此,使用
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
instead of
代替
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
import pandas as pd
import statsmodels.api as sm
NBA = pd.read_csv("NBA_train.csv")
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
yields
产量
OLS Regression Results
==============================================================================
Dep. Variable: W R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 6799.
Date: Sat, 21 Mar 2015 Prob (F-statistic): 0.00
Time: 14:58:05 Log-Likelihood: -2118.0
No. Observations: 835 AIC: 4242.
Df Residuals: 832 BIC: 4256.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 41.3048 1.610 25.652 0.000 38.144 44.465
PTS 0.0326 0.000 109.600 0.000 0.032 0.033
oppPTS -0.0326 0.000 -110.951 0.000 -0.033 -0.032
==============================================================================
Omnibus: 1.026 Durbin-Watson: 2.238
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.984
Skew: 0.084 Prob(JB): 0.612
Kurtosis: 3.009 Cond. No. 1.80e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

