pandas 在 Python 中使用 mca 包

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48521740/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:06:40  来源:igfitidea点击:

Using mca package in Python

python-3.xpandasscikit-learnpca

提问by Dan

I am trying to use the mca packageto do multiple correspondence analysis in Python.

我正在尝试使用mca 包在 Python 中进行多重对应分析。

I am a bit confused as to how to use it. With PCAI would expect to fitsome data (i.e. find principal components for those data) and then later I would be able to use the principal components that I found to transformunseen data.

我对如何使用它有点困惑。随着PCA我希望到适合一些数据(即找到这些数据主要成分),再后来我就能够使用,我发现主成分变换看不见的数据。

Based on the MCA documentation, I cannot work out how to do this last step. I also don't understand what any of the weirdly cryptically named properties and methods do (i.e. .E, .L, .K, .ketc).

根据 MCA 文档,我无法弄清楚如何执行最后一步。我也搞不懂什么任何的古怪cryptically命名的属性和方法做(即.E.L.K.k等)。

So far if I have a DataFrame with a column containing strings (assume this is the only column in the DF) I would do something like

到目前为止,如果我有一个包含字符串的列的 DataFrame(假设这是 DF 中唯一的列),我会做类似的事情

import mca
ca = mca.MCA(pd.get_dummies(df, drop_first=True))

from what I can gather

从我能收集到的

ca.fs_r(1)

is the transformation of the data in dfand

是数据在df

ca.L

is supposed to be the eigenvalues (although I get a vector of 1s that is one element fewer that my number of features?).

应该是特征值(尽管我得到的1s 向量比我的特征数少一个元素?)。

now if I had some more data with the same features, let's say df_newand assuming I've already converted this correctly to dummy variables, how do I find the equivalent of ca.fs_r(1)for the new data

现在,如果我有更多具有相同特征的数据,df_new假设我已经将其正确转换为虚拟变量,我如何找到ca.fs_r(1)新数据的等效项

采纳答案by Jan Trienes

The documentation of the mca package is not very clear with that regard. However, there are a few cues which suggest that ca.fs_r_sup(df_new)should be used to project new (unseen) data onto the factors obtained in the analysis.

mca 包的文档在这方面不是很清楚。然而,有一些线索表明ca.fs_r_sup(df_new)应该使用它来将新的(看不见的)数据投射到分析中获得的因素上。

  1. The package author refers to new data as supplementary datawhich is the terminology used in following paper: Abdi, H., & Valentin, D. (2007). Multiple correspondence analysis. Encyclopedia of measurement and statistics, 651-657.
  2. The package has only two functions which accept new data as parameter DF: fs_r_sup(self, DF, N=None)and fs_c_sup(self, DF, N=None). The latter is to find the column factor scores.
  3. The usage guidedemonstrates this based on a new data frame which has not been used throughout the component analysis.
  1. 包作者将新数据称为补充数据,这是以下论文中使用的术语:Abdi, H., & Valentin, D. (2007)。多重对应分析测量和统计百科全书,651-657。
  2. 该包只有两个接受新数据作为参数的函数DFfs_r_sup(self, DF, N=None)fs_c_sup(self, DF, N=None). 后者是求列因子分数。
  3. 使用指南演示此基于尚未在整个成分分析中使用的新的数据帧上。

回答by Axois

One other method is to use the library princewhich enables easy usage of tools such as:

另一种方法是使用库王子,它可以轻松使用工具,例如:

  1. Multiple correspondence analysis (MCA)
  2. Principal component analysis (PCA)
  3. Multiple factor analysis (MFA)
  1. 多重对应分析 (MCA)
  2. 主成分分析 (PCA)
  3. 多因素分析 (MFA)

You can begin first by installing with:

您可以首先安装:

pip install --user prince

To use MCA, it is fairly simple and can be done in a couple of steps (just like sklearn PCAmethod.) We first build our dataframe.

要使用MCA,它相当简单,可以通过几个步骤完成(就像sklearn PCA方法一样)。我们首先构建我们的数据框。

import pandas as pd 
import prince

X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']

print(X.head())

mca = prince.MCA()

# outputs
>>     Color   Size   Action    Age Inflated
   0  YELLOW  SMALL  STRETCH  ADULT        T
   1  YELLOW  SMALL  STRETCH  CHILD        F
   2  YELLOW  SMALL      DIP  ADULT        F
   3  YELLOW  SMALL      DIP  CHILD        F
   4  YELLOW  LARGE  STRETCH  ADULT        T

Followed by calling the fitand transformmethod.

其次是调用fitandtransform方法。

mca = mca.fit(X) # same as calling ca.fs_r(1)
mca = mca.transform(X) # same as calling ca.fs_r_sup(df_new) for *another* test set.
print(mca)

# outputs
>>         0             1
0   0.705387  8.373126e-15
1  -0.386586  8.336230e-15
2  -0.386586  6.335675e-15
3  -0.852014  6.726393e-15
4   0.783539 -6.333333e-01
5   0.783539 -6.333333e-01
6  -0.308434 -6.333333e-01
7  -0.308434 -6.333333e-01
8  -0.773862 -6.333333e-01
9   0.783539  6.333333e-01
10  0.783539  6.333333e-01
11 -0.308434  6.333333e-01
12 -0.308434  6.333333e-01
13 -0.773862  6.333333e-01
14  0.861691 -5.893240e-15
15  0.861691 -5.893240e-15
16 -0.230282 -5.930136e-15
17 -0.230282 -7.930691e-15
18 -0.695710 -7.539973e-15

You can even print out the picture diagram of it, since it incorporates matplotliblibrary.

你甚至可以打印出它的图片图表,因为它包含了matplotlib图书馆。

ax = mca.plot_coordinates(
     X=X,
     ax=None,
     figsize=(6, 6),
     show_row_points=True,
     row_points_size=10,
     show_row_labels=False,
     show_column_points=True,
     column_points_size=30,
     show_column_labels=False,
     legend_n_cols=1
     )

ax.get_figure().savefig('images/mca_coordinates.svg')

mca

马华