pandas 在 Python 中使用 mca 包
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48521740/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using mca package in Python
提问by Dan
I am trying to use the mca packageto do multiple correspondence analysis in Python.
我正在尝试使用mca 包在 Python 中进行多重对应分析。
I am a bit confused as to how to use it. With PCA
I would expect to fitsome data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transformunseen data.
我对如何使用它有点困惑。随着PCA
我希望到适合一些数据(即找到这些数据主要成分),再后来我就能够使用,我发现主成分变换看不见的数据。
Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E
, .L
, .K
, .k
etc).
根据 MCA 文档,我无法弄清楚如何执行最后一步。我也搞不懂什么任何的古怪cryptically命名的属性和方法做(即.E
,.L
,.K
,.k
等)。
So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like
到目前为止,如果我有一个包含字符串的列的 DataFrame(假设这是 DF 中唯一的列),我会做类似的事情
import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))
from what I can gather
从我能收集到的
ca.fs_r(1)
is the transformation of the data in df
and
是数据在df
和
ca.L
is supposed to be the eigenvalues (although I get a vector of 1
s that is one element fewer that my number of features?).
应该是特征值(尽管我得到的1
s 向量比我的特征数少一个元素?)。
now if I had some more data with the same features, let's say df_new
and assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)
for the new data
现在,如果我有更多具有相同特征的数据,df_new
假设我已经将其正确转换为虚拟变量,我如何找到ca.fs_r(1)
新数据的等效项
采纳答案by Jan Trienes
The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new)
should be used to project new (unseen) data onto the factors obtained in the analysis.
mca 包的文档在这方面不是很清楚。然而,有一些线索表明ca.fs_r_sup(df_new)
应该使用它来将新的(看不见的)数据投射到分析中获得的因素上。
- The package author refers to new data as supplementary datawhich is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
- The package has only two functions which accept new data as parameter
DF
:fs_r_sup(self, DF, N=None)
andfs_c_sup(self, DF, N=None)
. The latter is to find the column factor scores. - The usage guidedemonstrates this based on a new data frame which has not been used throughout the component analysis.
回答by Axois
One other method is to use the library princewhich enables easy usage of tools such as:
另一种方法是使用库王子,它可以轻松使用工具,例如:
- Multiple correspondence analysis (MCA)
- Principal component analysis (PCA)
- Multiple factor analysis (MFA)
- 多重对应分析 (MCA)
- 主成分分析 (PCA)
- 多因素分析 (MFA)
You can begin first by installing with:
您可以首先安装:
pip install --user prince
To use MCA
, it is fairly simple and can be done in a couple of steps (just like sklearn PCA
method.) We first build our dataframe.
要使用MCA
,它相当简单,可以通过几个步骤完成(就像sklearn PCA
方法一样)。我们首先构建我们的数据框。
import pandas as pd
import prince
X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
print(X.head())
mca = prince.MCA()
# outputs
>> Color Size Action Age Inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH CHILD F
2 YELLOW SMALL DIP ADULT F
3 YELLOW SMALL DIP CHILD F
4 YELLOW LARGE STRETCH ADULT T
Followed by calling the fit
and transform
method.
其次是调用fit
andtransform
方法。
mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)
# outputs
>> 0 1
0 0.705387 8.373126e-15
1 -0.386586 8.336230e-15
2 -0.386586 6.335675e-15
3 -0.852014 6.726393e-15
4 0.783539 -6.333333e-01
5 0.783539 -6.333333e-01
6 -0.308434 -6.333333e-01
7 -0.308434 -6.333333e-01
8 -0.773862 -6.333333e-01
9 0.783539 6.333333e-01
10 0.783539 6.333333e-01
11 -0.308434 6.333333e-01
12 -0.308434 6.333333e-01
13 -0.773862 6.333333e-01
14 0.861691 -5.893240e-15
15 0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15
You can even print out the picture diagram of it, since it incorporates matplotlib
library.
你甚至可以打印出它的图片图表,因为它包含了matplotlib
图书馆。
ax = mca.plot_coordinates(
X=X,
ax=None,
figsize=(6, 6),
show_row_points=True,
row_points_size=10,
show_row_labels=False,
show_column_points=True,
column_points_size=30,
show_column_labels=False,
legend_n_cols=1
)
ax.get_figure().savefig('images/mca_coordinates.svg')