Python 在 statsmodels 中捕获高度多重共线性

Question

提问by Amelio Vazquez-Reina

Say I fit a model in statsmodels

假设我在 statsmodels 中拟合了一个模型

mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()

When I do mod.summary()I may see the following:

当我这样做时，mod.summary()我可能会看到以下内容：

Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearityconditions in a variable? Is this warning stored somewhere in the model object?

有时警告是不同的（例如，基于设计矩阵的特征值）。如何捕获变量中的高多重共线性条件？此警告是否存储在模型对象中的某处？

Also, where can I find a description of the fields in summary()?

另外，在哪里可以找到中字段的描述summary()？

Answer 1

采纳答案by behzad.nouri

You can detect high-multi-collinearity by inspecting the eigen valuesof correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vectorshows which variables are collinear.

可以通过检查检测的高多重共线性特征值的相关矩阵。非常低的特征值表明数据是共线的，相应的特征向量表明哪些变量是共线的。

If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:

如果数据中不存在共线性，您会期望没有一个特征值接近于零：

>>> xs = np.random.randn(100, 5)      # independent variables
>>> corr = np.corrcoef(xs, rowvar=0)  # correlation matrix
>>> w, v = np.linalg.eig(corr)        # eigen values & eigen vectors
>>> w
array([ 1.256 ,  1.1937,  0.7273,  0.9516,  0.8714])

However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then

但是，如果说x[4] - 2 * x[0] - 3 * x[2] = 0，那么

>>> noise = np.random.randn(100)                      # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise  # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083,  1.9569,  1.1687,  0.8681,  0.9981])

one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:

其中一个特征值（这里是第一个）接近于零。对应的特征向量为：

>>> v[:,0]
array([-0.4077,  0.0059, -0.5886,  0.0018,  0.6981])

Ignoring almost zerocoefficients, above basically says x[0], x[2]and x[4]are colinear (as expected). If one standardizes xsvalues and multiplies by this eigen vector, the result will hover around zero with small variance:

忽略几乎为零的系数，上面基本上说x[0]，x[2]并且x[4]是共线的（如预期的那样）。如果将xs值标准化并乘以这个特征向量，结果将在零附近徘徊，方差很小：

>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0)  # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)

Note that ys.var()is basically the eigen value which was close to zero.

请注意，这ys.var()基本上是接近于零的特征值。

So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.

因此，为了捕获高多重线性，请查看相关矩阵的特征值。

Answer 2

回答by elz

Based on a similar questionfor R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.

基于R的类似问题，还有一些其他选项可以帮助人们。我正在寻找一个能够捕捉共线性的数字，选项包括相关矩阵的行列式和条件数。

According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.

根据 R 答案之一，相关矩阵的行列式将“范围从 0（完美共线性）到 1（无共线性）”。我发现有界范围很有帮助。

Translated example for determinant:

行列式的翻译示例：

import numpy as np
import pandas as pd

# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})

# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']

# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)

# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16

And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.

同样，协方差矩阵的条件数将接近无穷大，具有完美的线性相关性。

print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Python 在 statsmodels 中捕获高度多重共线性

提问by Amelio Vazquez-Reina

采纳答案by behzad.nouri

回答by elz

相关推荐

最近更新

标签

Python 在 statsmodels 中捕获高度多重共线性

提问by Amelio Vazquez-Reina

采纳答案by behzad.nouri

回答by elz

相关推荐

Python 使用 PYODBC 从 Pandas 获取数据到 SQL 服务器

保存python游戏的高分

Python 如何解析来自网络摄像机的 mjpeg http 流？

Python 将 Anaconda 安装到虚拟环境中

相关推荐

最近更新

标签