Python 如何使用 scikit-learn PCA 进行特征减少并知道哪些特征被丢弃
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23294616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use scikit-learn PCA for features reduction and know which features are discarded
提问by gc5
I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
我正在尝试在维度为 mxn 的矩阵上运行 PCA,其中 m 是特征数,n 是样本数。
Suppose I want to preserve the nf
features with the maximum variance. With scikit-learn
I am able to do it in this way:
假设我想保留nf
具有最大方差的特征。随着scikit-learn
我能够做到这样:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new
that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
现在,我得到一个X_new
形状为 nx nf的新矩阵。是否有可能知道哪些特征已被丢弃或保留?
Thanks
谢谢
采纳答案by eickenberg
The features that your PCA
object has determined during fitting are in pca.components_
. The vector space orthogonal to the one spanned by pca.components_
is discarded.
您的PCA
对象在拟合过程中确定的特征在 中pca.components_
。与所跨越的向量空间正交的向量空间pca.components_
被丢弃。
Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
请注意,PCA 不会“丢弃”或“保留”您的任何预定义功能(由您指定的列编码)。它混合所有这些(通过加权和)以找到最大方差的正交方向。
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection
如果这不是您正在寻找的行为,那么 PCA 降维不是可行的方法。对于一些简单的通用特征选择方法,可以看看sklearn.feature_selection
回答by emeth
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression
(Not discard).
投影到主成分上的特征将保留重要信息(方差最大的轴)并丢弃方差小的轴。这种行为就像compression
(不丢弃)。
And X_proj
is the better name of X_new
, because it is the projection of X
onto principal components
并且X_proj
是 的更好名称X_new
,因为它是 on 的X
投影principal components
You can reconstruct the X_rec
as
你可以重建X_rec
为
X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new
Here, X_rec
is close to X
, but the less important
information was dropped by PCA. So we can say X_rec
is denoised.
这里,X_rec
接近X
,但less important
信息被 PCA 丢弃了。所以我们可以说X_rec
是去噪的。
In my opinion, I can say the noise
is discard.
在我看来,我可以说the noise
是丢弃。
回答by Pramod Kalipatnapu
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
上面标记的答案是不正确的。sklearn 站点明确指出 components_ 数组已排序。所以它不能用于识别重要特征。
components_ : array, [n_components, n_features] Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
components_ : array, [n_components, n_features] 特征空间中的主轴,表示数据中最大方差的方向。组件按explain_variance_ 排序。
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html