Python PCA 对于分类特征?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40795141/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:00:14  来源:igfitidea点击:

PCA For categorical features?

pythonmachine-learningscikit-learndata-mining

提问by vikky

In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link:

根据我的理解,我认为 PCA 只能对连续特征执行。但是在尝试了解 onehot 编码和标签编码之间的区别时,来自以下链接中的帖子:

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

何时使用 One Hot Encoding、LabelEncoder 和 DictVectorizo​​r?

It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features. Hence confused, please suggest me on the same.

它指出一个热编码后跟 PCA 是一种非常好的方法,这基本上意味着 PCA 应用于分类特征。因此感到困惑,请建议我。

回答by Has QUIT--Anony-Mousse

I disagree with the others.

我不同意其他人的观点。

While you can use PCA on binary data(e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

虽然您可以对二进制数据(例如单热编码数据)使用 PCA,但这并不意味着它是一件好事,或者它会工作得很好。

PCA is desinged for continuousvariables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

PCA 是为连续变量设计的。它试图最小化方差(=平方偏差)。当您有二元变量时,平方偏差的概念就会失效。

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output - it's not as if PCA would segfault on such data. It works, but it is just much less meaningfulthan you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

所以是的,您可以使用 PCA。是的,你会得到一个输出。它甚至是最小二乘输出——PCA 不会对此类数据进行段错误。它有效,但它的意义远没有您希望的那么大;并且据说比例如频繁的模式挖掘意义不大。

回答by joscani

MCA is a known technique for categorical data dimension reduction. In R there is a lot of package to use MCA and even mix with PCA in mixed contexts. In python exist a a mca library too. MCA apply similar maths that PCA, indeed the French statistician used to say, "data analysis is to find correct matrix to diagonalize"

MCA 是用于分类数据降维的已知技术。在 R 中有很多包可以使用 MCA,甚至在混合上下文中与 PCA 混合使用。在 python 中也存在一个 mca 库。MCA 应用了与 PCA 类似的数学,实际上法国统计学家曾经说过,“数据分析是找到正确的矩阵来对角化”

http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

回答by Oleg Melnikov

The following publication shows great and meaningful results when computing PCA on categorical variables treated as simplex vertices:

在对被视为单纯形顶点的分类变量计算 PCA 时,以下出版物显示了重要且有意义的结果:

Niitsuma H., Okada T. (2005) Covariance and PCA for Categorical Variables. In: Ho T.B., Cheung D., Liu H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin, Heidelberg

https://doi.org/10.1007/11430919_61

Niitsuma H., Okada T. (2005) 分类变量的协方差和 PCA。见:Ho TB, Cheung D., Liu H. (eds) 知识发现和数据挖掘的进展。PAKDD 2005。计算机科学讲义,第 3518 卷。斯普林格,柏林,海德堡

https://doi.org/10.1007/11430919_61

It is available via https://arxiv.org/abs/0711.4452(including as a PDF).

可通过https://arxiv.org/abs/0711.4452(包括 PDF)获得。

回答by Ockhius

Basically, PCA finds and eliminate less informative (duplicate) information on feature set and reduce the dimension of feature space. In other words, imagine a N-dimensional hyperspace, PCA finds such M (M < N) features that the data variates most. In this way data may be represented as M-dimensional feature vectors. Mathematically, it is some-kind of a eigen-values & eigen vectors calculation of a feature space.

基本上,PCA 发现并消除特征集上信息量较少(重复)的信息并降低特征空间的维数。换句话说,想象一个 N 维的超空间,PCA 找到了数据变化最大的 M (M < N) 个特征。通过这种方式,数据可以表示为 M 维特征向量。从数学上讲,它是某种特征空间的特征值和特征向量计算。

So, it is not important whether the features are continuous or not.

因此,特征是否连续并不重要。

PCA is used widely on many application. Mostly for eliminating noisy, less informative data that comes from some sensor or hardware before classification/recognition.

PCA 广泛用于许多应用程序。主要用于在分类/识别之前消除来自某些传感器或硬件的嘈杂、信息较少的数据。

Edit:

编辑:

Statistically speaking, categorical features can be seen as discrete random variables in interval [0,1]. Computation for expectation E{X} and variance E{(X-E{X})^2) are still valid and meaningful for discrete rvs. I still stand for the applicability of PCA in case of categorical features.

从统计学上讲,分类特征可以看作是区间 [0,1] 中的离散随机变量。期望 E{X} 和方差 E{(XE{X})^2) 的计算对于离散 rvs 仍然有效且有意义。在分类特征的情况下,我仍然支持 PCA 的适用性。

Consider a case where you would like to predict whether "It is going to rain for a given day or not". You have categorical feature X which is "Do I have to go to work for the given day", 1 for yes and 0 for no. Clearly weather conditions do not depend on our work schedule, so P(R|X)=P(R). Assuming 5 days of work for every week, we have more 1s than 0s for X in our randomly collected dataset. PCA would probably lead to dropping this low-variance dimension in your feature representation.

考虑这样一种情况,您想预测“某一天是否会下雨”。您有分类特征 X,即“我是否必须在特定日期上班”,1 表示是,0 表示否。显然天气状况不取决于我们的工作时间表,所以 P(R|X)=P(R)。假设每周工作 5 天,在我们随机收集的数据集中,X 的 1 比 0 多。PCA 可能会导致在您的特征表示中删除这种低方差维度。

At the end of the day, PCA is for dimension reduction with minimal loss of information. Intuitively, we rely on variance of the data on a given axis to measure its usefulness for the task. I don't think there is any theoretical limitation for applying it to categorical features. Practical value depends on application and data which is also the case for continuous variables.

归根结底,PCA 用于降维,同时将信息损失降至最低。直观地说,我们依靠给定轴上数据的方差来衡量其对任务的有用性。我认为将其应用于分类特征没有任何理论上的限制。实际价值取决于应用和数据,连续变量也是如此。

回答by AlexG

PCA is a dimensionality reduction methodthat can be applied any set of features. Here is an example using OneHotEncoded (i.e. categorical) data:

PCA 是一种降维方法,可以应用于任何特征集。这是一个使用 OneHotEncoded(即分类)数据的示例:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X = enc.fit_transform([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]).toarray()

print(X)

> array([[ 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.]])


from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

print(X_pca)

> array([[-0.70710678,  0.79056942,  0.70710678],
       [ 1.14412281, -0.79056942,  0.43701602],
       [-1.14412281, -0.79056942, -0.43701602],
       [ 0.70710678,  0.79056942, -0.70710678]])

回答by NicolasLi

I think pca is reducing var by leverage the linear relation between vars. If there's only one categoral var coded in onehot, there's not linear relation between the onehoted cols. so it can't reduce by pca.

我认为 pca 通过利用 var 之间的线性关系来减少 var。如果 onehot 中只有一个分类 var 编码,则 onehoted cols 之间没有线性关系。所以它不能通过 pca 减少。

But if there exsits other vars, the onehoted cols may be can presented by linear relation of other vars.

但是,如果存在其他变量,则可以通过其他变量的线性关系来表示单个热列。

So may be it can reduce by pca, depends on the relation of vars.

所以可能它可以减少 pca,取决于 vars 的关系。