Python 使用sklearn在PCA中恢复explain_variance_ratio_的特征名称
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22984335/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Recovering features names of explained_variance_ratio_ in PCA with sklearn
提问by mazieres
I'm trying to recover from a PCA done with scikit-learn, whichfeatures are selected as relevant.
我正在尝试从使用 scikit-learn 完成的 PCA 中恢复,哪些特征被选为相关的。
A classic example with IRIS dataset.
IRIS 数据集的经典示例。
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
df_norm = (df - df.mean()) / df.std()
# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_
This returns
这返回
In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452, 0.23030523])
How can I recover which two features allow these two explained variance among the dataset ?Said diferently, how can i get the index of this features in iris.feature_names ?
我怎样才能恢复哪两个特征允许这两个数据集之间的解释差异?换个说法,我如何在 iris.feature_names 中获取此功能的索引?
In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Thanks in advance for your help.
在此先感谢您的帮助。
采纳答案by Rafa
This information is included in the pca
attribute: components_
. As described in the documentation, pca.components_
outputs an array of [n_components, n_features]
, so to get how components are linearly related with the different features you have to:
此信息包含在pca
属性中:components_
。如文档中所述,pca.components_
输出一个数组[n_components, n_features]
,因此要了解组件与不同功能之间的线性关系,您必须:
Note: each coefficient represents the correlation between a particular pair of component and feature
注意:每个系数代表特定组件和特征对之间的相关性
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)
# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)
# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
PC-1 0.522372 -0.263355 0.581254 0.565611
PC-2 -0.372318 -0.925556 -0.021095 -0.065416
IMPORTANT:As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this postfor further reference.
重要提示:作为旁注,请注意 PCA 符号不会影响其解释,因为该符号不会影响每个组件中包含的方差。只有形成 PCA 维度的特征的相对符号才是重要的。事实上,如果您再次运行 PCA 代码,您可能会得到符号反转的 PCA 尺寸。为了直观地了解这一点,请考虑 3-D 空间中的向量及其负数 - 两者本质上都表示空间中的相同方向。检查此帖子以获取进一步参考。
回答by behzad.nouri
Edit: as others have commented, you may get same values from .components_
attribute.
编辑:正如其他人评论的那样,您可能会从.components_
属性中获得相同的值。
Each principal component is a linear combination of the original variables:
每个主成分都是原始变量的线性组合:
where X_i
s are the original variables, and Beta_i
s are the corresponding weights or so called coefficients.
其中X_i
s 是原始变量,Beta_i
s 是相应的权重或所谓的系数。
To obtain the weights, you may simply pass identity matrix to the transform
method:
要获得权重,您可以简单地将单位矩阵传递给该transform
方法:
>>> i = np.identity(df.shape[1]) # identity matrix
>>> i
array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]])
>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
[-0.2634, -0.9256],
[ 0.5813, -0.0211],
[ 0.5656, -0.0654]])
Each column of the coef
matrix above shows the weights in the linear combination which obtains corresponding principal component:
coef
上面矩阵的每一列显示了线性组合中的权重,以获得相应的主成分:
>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
PC-1 PC-2
sepal length (cm) 0.522 -0.372
sepal width (cm) -0.263 -0.926
petal length (cm) 0.581 -0.021
petal width (cm) 0.566 -0.065
[4 rows x 2 columns]
For example, above shows that the second principal component (PC-2
) is mostly aligned with sepal width
, which has the highest weight of 0.926
in absolute value;
例如,上图显示第二主成分 ( PC-2
) 大部分与 对齐sepal width
,其0.926
绝对值的权重最高;
Since the data were normalized, you can confirm that the principal components have variance 1.0
which is equivalent to each coefficient vector having norm 1.0
:
由于数据已归一化,您可以确认主成分具有方差1.0
,这相当于每个具有 norm 的系数向量1.0
:
>>> np.linalg.norm(coef,axis=0)
array([ 1., 1.])
One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:
也可以确认主成分可以计算为上述系数和原始变量的点积:
>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True
Note that we need to use numpy.allclose
instead of regular equality operator, because of floating point precision error.
请注意numpy.allclose
,由于浮点精度错误,我们需要使用代替常规等式运算符。
回答by eickenberg
Given your fitted estimator pca
, the components are to be found in pca.components_
, which represent the directions of highest variance in the dataset.
给定您的拟合估计量pca
,组件将在 中找到pca.components_
,代表数据集中方差最大的方向。
回答by amunnelly
The way this question is phrased reminds me of a misunderstanding of Principle Component Analysis when I was first trying to figure it out. I'd like to go through it here in the hope that others won't spend as much time on a road-to-nowhere as I did before the penny finally dropped.
这个问题的措辞方式让我想起了当我第一次试图弄清楚它时对主成分分析的误解。我想在这里通读一遍,希望其他人不会像我在一分钱最终落下之前那样,在无路可走的路上花费太多时间。
The notion of “recovering” feature names suggests that PCA identifies those features that are most important in a dataset. That's not strictly true.
“恢复”特征名称的概念表明 PCA 可以识别数据集中最重要的特征。这不完全正确。
PCA, as I understand it, identifies the features with the greatest variance in a dataset, and can then use this quality of the dataset to create a smaller dataset with a minimal loss of descriptive power. The advantages of a smaller dataset is that it requires less processing power and should have less noise in the data. But the features of greatest variance are not the "best" or "most important" features of a dataset, insofar as such concepts can be said to exist at all.
据我所知,PCA 识别数据集中方差最大的特征,然后可以使用数据集的这种质量来创建一个较小的数据集,同时将描述能力的损失降至最低。较小数据集的优势在于它需要较少的处理能力,并且数据中的噪声较小。但是最大方差的特征并不是数据集的“最佳”或“最重要”特征,只要这些概念可以说是完全存在的。
To bring that theory into the practicalities of @Rafa's sample code above:
将该理论带入上述@Rafa 示例代码的实用性中:
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)
# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)
consider the following:
考虑以下:
post_pca_array = pca.fit_transform(data_scaled)
print data_scaled.shape
(150, 4)
print post_pca_array.shape
(150, 2)
In this case, post_pca_array
has the same 150 rows of data as data_scaled
, but data_scaled
's four columns have been reduced from four to two.
在本例中,post_pca_array
具有与 相同的 150 行数据data_scaled
,但data_scaled
的四列已从四列减少到两列。
The critical point here is that the two columns – or components, to be terminologically consistent – of post_pca_array
are not the two “best” columns of data_scaled
. They are two new columns, determined by the algorithm behind sklearn.decomposition
's PCA
module. The second column, PC-2
in @Rafa's example, is informed by sepal_width
more than any other column, but the values in PC-2
and data_scaled['sepal_width']
are not the same.
这里的关键点是两列——或组件,术语上是一致的——post_pca_array
不是data_scaled
. 它们是两个新列,由sklearn.decomposition
的PCA
模块背后的算法确定。PC-2
在@Rafa 的示例中,第二列的sepal_width
信息比任何其他列都多,但PC-2
和data_scaled['sepal_width']
中的值不同。
As such, while it's interesting to find out how much each column in original data contributed to the components of a post-PCA dataset, the notion of “recovering” column names is a little misleading, and certainly misled me for a long time. The only situation where there would be a match between post-PCA and original columns would be if the number of principle components were set at the same number as columns in the original. However, there would be no point in using the same number of columns because the data would not have changed. You would only have gone there to come back again, as it were.
因此,虽然找出原始数据中的每一列对 post-PCA 数据集的组成部分的贡献很有趣,但“恢复”列名的概念有点误导,而且肯定误导了我很长时间。后 PCA 和原始列之间匹配的唯一情况是主成分的数量设置为与原始列中的列数相同。但是,使用相同数量的列没有意义,因为数据不会改变。你去那里只会再次回来,就像它一样。
回答by seralouk
The important features are the ones that influence more the components and thus, have a large absolute value/coefficient/loading on the component.
重要的特征是那些影响更多组件的特征,因此对组件具有大的绝对值/系数/负载。
Get the most important feature name
on the PCs:
获取the most important feature name
的个人电脑:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component i.e. largest absolute value
# using LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# using LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(sorted(dic.items()))
This prints:
这打印:
0 1
0 PC1 e
1 PC2 d
Conclusion/Explanation:
结论/解释:
So on the PC1 the feature named e
is the most important and on PC2 the d
.
因此,在 PC1 上,命名的功能e
是最重要的,而在 PC2 上,d
.