Python 在 statsmodels 中捕获高度多重共线性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25676145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Capturing high multi-collinearity in statsmodels
提问by Amelio Vazquez-Reina
Say I fit a model in statsmodels
假设我在 statsmodels 中拟合了一个模型
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary()I may see the following:
当我这样做时,mod.summary()我可能会看到以下内容:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearityconditions in a variable? Is this warning stored somewhere in the model object?
有时警告是不同的(例如,基于设计矩阵的特征值)。如何捕获变量中的高多重共线性条件?此警告是否存储在模型对象中的某处?
Also, where can I find a description of the fields in summary()?
另外,在哪里可以找到 中字段的描述summary()?
采纳答案by behzad.nouri
You can detect high-multi-collinearity by inspecting the eigen valuesof correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vectorshows which variables are collinear.
可以通过检查检测的高多重共线性特征值的相关矩阵。非常低的特征值表明数据是共线的,相应的特征向量表明哪些变量是共线的。
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
如果数据中不存在共线性,您会期望没有一个特征值接近于零:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
但是,如果说x[4] - 2 * x[0] - 3 * x[2] = 0,那么
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
其中一个特征值(这里是第一个)接近于零。对应的特征向量为:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zerocoefficients, above basically says x[0], x[2]and x[4]are colinear (as expected). If one standardizes xsvalues and multiplies by this eigen vector, the result will hover around zero with small variance:
忽略几乎为零的系数,上面基本上说x[0],x[2]并且x[4]是共线的(如预期的那样)。如果将xs值标准化并乘以这个特征向量,结果将在零附近徘徊,方差很小:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var()is basically the eigen value which was close to zero.
请注意,这ys.var()基本上是接近于零的特征值。
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.
因此,为了捕获高多重线性,请查看相关矩阵的特征值。
回答by elz
Based on a similar questionfor R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
基于R的类似问题,还有一些其他选项可以帮助人们。我正在寻找一个能够捕捉共线性的数字,选项包括相关矩阵的行列式和条件数。
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
根据 R 答案之一,相关矩阵的行列式将“范围从 0(完美共线性)到 1(无共线性)”。我发现有界范围很有帮助。
Translated example for determinant:
行列式的翻译示例:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
同样,协方差矩阵的条件数将接近无穷大,具有完美的线性相关性。
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

