Python scikit 学习 pca.explained_variance_ratio_ cutoff

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32857029/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:23:35  来源:igfitidea点击:

Python scikit learn pca.explained_variance_ratio_ cutoff

pythonscikit-learnpca

提问by Chubaka

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.

在选择主成分 (k) 的数量时,我们选择 k 作为最小值,以便例如保留 99% 的方差。

However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.

但是,在 Python Scikit 学习中,我不是 100% 确定pca.explained_variance_ratio_ = 0.99等于“保留了 99% 的方差”?有谁能开导吗?谢谢。

  • The Python Scikit learn PCA manual is here
  • Python Scikit 学习 PCA 手册在这里

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

采纳答案by Curt F.

Yes, you are nearly right. The pca.explained_variance_ratio_parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i]gives the variance explained solely by the i+1st dimension.

是的,你几乎是对的。该pca.explained_variance_ratio_参数返回由每个维度解释的方差的向量。因此pca.explained_variance_ratio_[i]给出了仅由第 i+1 个维度解释的方差。

You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector xsuch that x[i]returns the cumulativevariance explained by the first i+1 dimensions.

你可能想要做pca.explained_variance_ratio_.cumsum()。这将返回一个向量x,以便x[i]返回由前 i+1 个维度解释的累积方差。

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(0)
my_matrix = np.random.randn(20, 5)

my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)

print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()


[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
[ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

So in my random toy data, if I picked k=4I would retain 93.3% of the variance.

所以在我的随机玩具数据中,如果我选择k=4我会保留 93.3% 的方差。

回答by Yannic Klem

Although this question is older than 2 years i want to provide an update on this. I wanted to do the same and it looks like sklearn now provides this feature out of the box.

虽然这个问题已经超过 2 年了,但我想对此提供更新。我想做同样的事情,看起来 sklearn 现在提供了开箱即用的功能。

As stated in the docs

文档中所述

if 0 < n_components < 1 and svd_solver == ‘full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

如果 0 < n_components < 1 且 svd_solver == 'full',则选择需要解释的方差量大于 n_components 指定的百分比的分量数

So the code required is now

所以现在需要的代码是

my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)

回答by Julian

This worked for me with even less typing in the PCA section. The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.

这对我有用,在 PCA 部分打字更少。为方便起见,添加其余部分。只需在较早的阶段定义“数据”。

import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA 

st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc

#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
      round(pca.explained_variance_ratio_.sum(),5)  )