pandas 在 Python 中使用 mca 包

Question

提问by Dan

I am trying to use the mca packageto do multiple correspondence analysis in Python.

我正在尝试使用mca 包在 Python 中进行多重对应分析。

I am a bit confused as to how to use it. With PCAI would expect to fitsome data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transformunseen data.

我对如何使用它有点困惑。随着PCA我希望到适合一些数据（即找到这些数据主要成分），再后来我就能够使用，我发现主成分变换看不见的数据。

Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E, .L, .K, .ketc).

根据 MCA 文档，我无法弄清楚如何执行最后一步。我也搞不懂什么任何的古怪cryptically命名的属性和方法做（即.E，.L，.K，.k等）。

So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like

到目前为止，如果我有一个包含字符串的列的 DataFrame（假设这是 DF 中唯一的列），我会做类似的事情

import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))

from what I can gather

从我能收集到的

ca.fs_r(1)

is the transformation of the data in dfand

是数据在df和

ca.L

is supposed to be the eigenvalues (although I get a vector of 1s that is one element fewer that my number of features?).

应该是特征值（尽管我得到的1s 向量比我的特征数少一个元素？）。

now if I had some more data with the same features, let's say df_newand assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)for the new data

现在，如果我有更多具有相同特征的数据，df_new假设我已经将其正确转换为虚拟变量，我如何找到ca.fs_r(1)新数据的等效项

Answer 1

采纳答案by Jan Trienes

The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new)should be used to project new (unseen) data onto the factors obtained in the analysis.

mca 包的文档在这方面不是很清楚。然而，有一些线索表明ca.fs_r_sup(df_new)应该使用它来将新的（看不见的）数据投射到分析中获得的因素上。

The package author refers to new data as supplementary datawhich is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
The package has only two functions which accept new data as parameter DF: fs_r_sup(self, DF, N=None)and fs_c_sup(self, DF, N=None). The latter is to find the column factor scores.
The usage guidedemonstrates this based on a new data frame which has not been used throughout the component analysis.

包作者将新数据称为补充数据，这是以下论文中使用的术语：Abdi, H., & Valentin, D. (2007)。多重对应分析。测量和统计百科全书，651-657。
该包只有两个接受新数据作为参数的函数DF：fs_r_sup(self, DF, N=None)和fs_c_sup(self, DF, N=None). 后者是求列因子分数。
的使用指南演示此基于尚未在整个成分分析中使用的新的数据帧上。

Answer 2

回答by Axois

One other method is to use the library princewhich enables easy usage of tools such as:

另一种方法是使用库王子，它可以轻松使用工具，例如：

Multiple correspondence analysis (MCA)
Principal component analysis (PCA)
Multiple factor analysis (MFA)

多重对应分析 (MCA)
主成分分析 (PCA)
多因素分析 (MFA)

You can begin first by installing with:

您可以首先安装：

pip install --user prince

To use MCA, it is fairly simple and can be done in a couple of steps (just like sklearn PCAmethod.) We first build our dataframe.

要使用MCA，它相当简单，可以通过几个步骤完成（就像sklearn PCA方法一样）。我们首先构建我们的数据框。

import pandas as pd 
import prince

X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']

print(X.head())

mca = prince.MCA()

# outputs
>>     Color   Size   Action    Age Inflated
   0  YELLOW  SMALL  STRETCH  ADULT        T
   1  YELLOW  SMALL  STRETCH  CHILD        F
   2  YELLOW  SMALL      DIP  ADULT        F
   3  YELLOW  SMALL      DIP  CHILD        F
   4  YELLOW  LARGE  STRETCH  ADULT        T

Followed by calling the fitand transformmethod.

其次是调用fitandtransform方法。

mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)

# outputs
>>         0             1
0   0.705387  8.373126e-15
1  -0.386586  8.336230e-15
2  -0.386586  6.335675e-15
3  -0.852014  6.726393e-15
4   0.783539 -6.333333e-01
5   0.783539 -6.333333e-01
6  -0.308434 -6.333333e-01
7  -0.308434 -6.333333e-01
8  -0.773862 -6.333333e-01
9   0.783539  6.333333e-01
10  0.783539  6.333333e-01
11 -0.308434  6.333333e-01
12 -0.308434  6.333333e-01
13 -0.773862  6.333333e-01
14  0.861691 -5.893240e-15
15  0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15

You can even print out the picture diagram of it, since it incorporates matplotliblibrary.

你甚至可以打印出它的图片图表，因为它包含了matplotlib图书馆。

ax = mca.plot_coordinates(
     X=X,
     ax=None,
     figsize=(6, 6),
     show_row_points=True,
     row_points_size=10,
     show_row_labels=False,
     show_column_points=True,
     column_points_size=30,
     show_column_labels=False,
     legend_n_cols=1
     )

ax.get_figure().savefig('images/mca_coordinates.svg')

pandas 在 Python 中使用 mca 包

提问by Dan

采纳答案by Jan Trienes

回答by Axois

相关推荐

最近更新

标签

pandas 在 Python 中使用 mca 包

提问by Dan

采纳答案by Jan Trienes

回答by Axois

相关推荐

pandas 如何在pandas数据框中的所有列中获取唯一值

pandas 熊猫数据框。按值和计数分组

Pandas TypeError：仅对 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 有效，但得到了“Int64Index”的实例

pandas 如何使用pandas将工作表添加到现有的excel文件中？

相关推荐

最近更新

标签