Python 如何使用 scikit-learn PCA 进行特征减少并知道哪些特征被丢弃

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23294616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:40:04  来源:igfitidea点击:

How to use scikit-learn PCA for features reduction and know which features are discarded

pythonmachine-learningscikit-learnpcafeature-selection

提问by gc5

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.

我正在尝试在维度为 mxn 的矩阵上运行 PCA,其中 m 是特征数,n 是样本数。

Suppose I want to preserve the nffeatures with the maximum variance. With scikit-learnI am able to do it in this way:

假设我想保留nf具有最大方差的特征。随着scikit-learn我能够做到这样:

from sklearn.decomposition import PCA

nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)

X_new = pca.transform(X)

Now, I get a new matrix X_newthat has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?

现在,我得到一个X_new形状为 nx nf的新矩阵。是否有可能知道哪些特征已被丢弃或保留?

Thanks

谢谢

采纳答案by eickenberg

The features that your PCAobject has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_is discarded.

您的PCA对象在拟合过程中确定的特征在 中pca.components_。与所跨越的向量空间正交的向量空间pca.components_被丢弃。

Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.

请注意,PCA 不会“丢弃”或“保留”您的任何预定义功能(由您指定的列编码)。它混合所有这些(通过加权和)以找到最大方差的正交方向。

If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection

如果这不是您正在寻找的行为,那么 PCA 降维不是可行的方法。对于一些简单的通用特征选择方法,可以看看sklearn.feature_selection

回答by emeth

The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression(Not discard).

投影到主成分上的特征将保留重要信息(方差最大的轴)并丢弃方差小的轴。这种行为就像compression(不丢弃)。

And X_projis the better name of X_new, because it is the projection of Xonto principal components

并且X_proj是 的更好名称X_new,因为它是 on 的X投影principal components

You can reconstruct the X_recas

你可以重建X_rec

X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new

Here, X_recis close to X, but the less importantinformation was dropped by PCA. So we can say X_recis denoised.

这里,X_rec接近X,但less important信息被 PCA 丢弃了。所以我们可以说X_rec是去噪的。

In my opinion, I can say the noiseis discard.

在我看来,我可以说the noise是丢弃。

回答by Pramod Kalipatnapu

The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.

上面标记的答案是不正确的。sklearn 站点明确指出 components_ 数组已排序。所以它不能用于识别重要特征。

components_ : array, [n_components, n_features] Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

components_ : array, [n_components, n_features] 特征空间中的主轴,表示数据中最大方差的方向。组件按explain_variance_ 排序。

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html