Python 从 sklearn PCA 获取特征值和向量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31909945/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Obtain eigen values and vectors from sklearn PCA
提问by Abhishek Bhatia
How I can get the the eigen values and eigen vectors of the PCA application?
如何获得 PCA 应用程序的特征值和特征向量?
from sklearn.decomposition import PCA
clf=PCA(0.98,whiten=True) #converse 98% variance
X_train=clf.fit_transform(X_train)
X_test=clf.transform(X_test)
I can't find it in docs.
我在docs 中找不到它。
1.I am "not" able to comprehend the different results here.
1.我“不能”理解这里的不同结果。
Edit:
编辑:
def pca_code(data):
#raw_implementation
var_per=.98
data-=np.mean(data, axis=0)
data/=np.std(data, axis=0)
cov_mat=np.cov(data, rowvar=False)
evals, evecs = np.linalg.eigh(cov_mat)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
variance_retained=np.cumsum(evals)/np.sum(evals)
index=np.argmax(variance_retained>=var_per)
evecs = evecs[:,:index+1]
reduced_data=np.dot(evecs.T, data.T).T
print(evals)
print("_"*30)
print(evecs)
print("_"*30)
#using scipy package
clf=PCA(var_per)
X_train=data.T
X_train=clf.fit_transform(X_train)
print(clf.explained_variance_)
print("_"*30)
print(clf.components_)
print("__"*30)
- I wish to obtain all the eigenvalues and eigenvectors instead of just the reduced set with the convergence condition.
- 我希望获得所有特征值和特征向量,而不仅仅是具有收敛条件的缩减集。
采纳答案by ldirer
Your implementation
您的实施
You are computing the eigenvectors of the correlation matrix, that is the covariance matrix of the normalizedvariables.data/=np.std(data, axis=0)
is not part of the classic PCA, we only center the variables.
So the sklearn PCA does not feature scale the data beforehand.
您正在计算相关矩阵的特征向量,即归一化变量的协方差矩阵。data/=np.std(data, axis=0)
不是经典 PCA 的一部分,我们只将变量居中。因此 sklearn PCA不会预先对数据进行缩放。
Apart from that you are on the right track, if we abstract the fact that the code you provided did not run ;).
You only got confused with the row/column layouts. Honestly I think it's much easier to start with X = data.T
and work only with X from there on. I added your code 'fixed' at the end of the post.
除此之外,您走在正确的轨道上,如果我们抽象出您提供的代码没有运行的事实;)。您只会对行/列布局感到困惑。老实说,我认为X = data.T
从那以后只使用 X开始和工作要容易得多。我在帖子末尾添加了您的代码“已修复”。
Getting the eigenvalues
获取特征值
You already noted that you can get the eigenvectors using clf.components_
.
您已经注意到您可以使用clf.components_
.
So you have the principal components. They are eigenvectors of the covariancematrix $X^T X$.
所以你有主要的组成部分。它们是协方差矩阵 $X^TX$ 的特征向量。
A way to retrieve the eigenvalues from there is to apply this matrix to each principal components and project the results onto the component.
Let v_1 be the first principal component and lambda_1 the associated eigenvalue. We have:and thus:
since
. (x, y) the scalar product of vectors x and y.
从那里检索特征值的一种方法是将此矩阵应用于每个主成分并将结果投影到该成分上。设 v_1 是第一个主成分,而 lambda_1 是相关的特征值。我们有:因此:
因为
。(x, y) 向量 x 和 y 的标量积。
Back in Python you can do:
回到 Python,您可以执行以下操作:
n_samples = X.shape[0]
# We center the data and compute the sample covariance matrix.
X -= np.mean(X, axis=0)
cov_matrix = np.dot(X.T, X) / n_samples
for eigenvector in pca.components_:
print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
And you get the eigenvalue associated with the eigenvector. Well, in my tests it turned out not to work with the couple last eigenvalues but I'd attribute that to my absence of skills in numerical stability.
你得到与特征向量相关的特征值。好吧,在我的测试中,结果证明对这对夫妇的最后特征值不起作用,但我将其归因于我缺乏数值稳定性方面的技能。
Now that's not the bestway to get the eigenvalues but it's nice to know where they come from.
The eigenvalues represent the variance in the direction of the eigenvector. So you can get them through the pca.explained_variance_
attribute:
现在这不是获得特征值的最佳方法,但很高兴知道它们来自哪里。
特征值表示特征向量方向上的方差。所以你可以通过pca.explained_variance_
属性获取它们:
eigenvalues = pca.explained_variance_
Here is a reproducible example that prints the eigenvalues you get with each method:
这是一个可重现的示例,用于打印您使用每种方法获得的特征值:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000)
n_samples = X.shape[0]
pca = PCA()
X_transformed = pca.fit_transform(X)
# We center the data and compute the sample covariance matrix.
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.dot(X_centered.T, X_centered) / n_samples
eigenvalues = pca.explained_variance_
for eigenvalue, eigenvector in zip(eigenvalues, pca.components_):
print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
print(eigenvalue)
Your original code, fixed
您的原始代码,已修复
If you run it you'll see the values are consistent. They're not exactly equal because numpy and scikit-learn are not using the same algorithm here.
The main thing was that you were using correlation matrix instead of covariance, as mentioned above. Also you were getting the transposedeigenvectors from numpy which made it very confusing.
如果你运行它,你会看到值是一致的。它们并不完全相等,因为 numpy 和 scikit-learn 在这里没有使用相同的算法。
主要是你使用的是相关矩阵而不是协方差,如上所述。此外,您还从 numpy获得了转置的特征向量,这使它非常混乱。
import numpy as np
from scipy.stats.mstats import zscore
from sklearn.decomposition import PCA
def pca_code(data):
#raw_implementation
var_per=.98
data-=np.mean(data, axis=0)
# data/=np.std(data, axis=0)
cov_mat=np.cov(data, rowvar=False)
evals, evecs = np.linalg.eigh(cov_mat)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
variance_retained=np.cumsum(evals)/np.sum(evals)
index=np.argmax(variance_retained>=var_per)
evecs = evecs[:,:index+1]
reduced_data=np.dot(evecs.T, data.T).T
print("evals", evals)
print("_"*30)
print(evecs.T[1, :])
print("_"*30)
#using scipy package
clf=PCA(var_per)
X_train=data
X_train=clf.fit_transform(X_train)
print(clf.explained_variance_)
print("_"*30)
print(clf.components_[1,:])
print("__"*30)
Hope this helps, feel free to ask for clarifications.
希望这会有所帮助,请随时要求澄清。
回答by Lee
I used the sklearn PCA function. The return parameters 'components_' is eigen vectors and 'explained_variance_' is eigen values. Below is my test code.
我使用了 sklearn PCA 功能。返回参数“components_”是特征向量,“explained_variance_”是特征值。下面是我的测试代码。
from sklearn.decomposition import PCA
import numpy as np
def main():
data = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2, 1.6], [1, 1.1], [1.5, 1.6], [1.1, 0.9]])
print(data)
pca = PCA()
pca.fit(data)
print(pca.components_)
print(pca.explained_variance_)
if __name__ == "__main__":
main()