Python 多个分类变量之间的相关性（Pandas）

Question

提问by zar3bski

I have a data set made of 22 categorical variables(non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

我有一个由 22 个分类变量（无序）组成的数据集。我想在一个不错的热图中可视化它们的相关性。由于 Pandas 内置函数

DataFrame.corr(method='pearson', min_periods=1)

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step(rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

只实现数值变量（Pearson、Kendall、Spearman）的相关系数，我必须自己聚合它来执行卡方或类似的东西，我不太确定使用哪个函数在一个优雅的步骤中完成它（而不是遍历所有 cat1*cat2 对）。需要明确的是，这就是我想要的结果（一个数据框）：

         cat1  cat2  cat3  
  cat1|  coef  coef  coef  
  cat2|  coef  coef  coef
  cat3|  coef  coef  coef

Any ideas with pd.pivot_tableor something in the same vein?

任何关于pd.pivot_table或类似内容的想法？

thanks in advance D.

提前致谢。

Answer 1

采纳答案by YOBEN_S

You can using pd.factorize

您可以使用 pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]: 
     a    c    d
a  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0

Data input

数据输入

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

Update

更新

from scipy.stats import chisquare

df=df.apply(lambda x : pd.factorize(x)[0])+1

pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])

Out[123]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})

Answer 2

回答by Shashwat Tiwary

Found a nice and clean solution in this post. It's not one steps but provides what is required.Post on correlation for categorical variables

在这篇文章中找到了一个漂亮而干净的解决方案。这不是一个步骤，而是提供了所需的内容。发布分类变量的相关性

Answer 3

回答by zar3bski

Turns out, the only solution I found is to iterate trough all the factor*factor pairs.

事实证明，我找到的唯一解决方案是遍历所有因子 * 因子对。

factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

chi2, p_values =[], []

for f in factors_paired:
    if f[0] != f[1]:
        chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))   
        chi2.append(chitest[0])
        p_values.append(chitest[1])
    else:      # for same factor pair
        chi2.append(0)
        p_values.append(0)

chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience

Python 多个分类变量之间的相关性（Pandas）

提问by zar3bski

采纳答案by YOBEN_S

回答by Shashwat Tiwary

回答by zar3bski

相关推荐

最近更新

标签

Python 多个分类变量之间的相关性（Pandas）

提问by zar3bski

采纳答案by YOBEN_S

回答by Shashwat Tiwary

回答by zar3bski

相关推荐

Python 如何在 Windows 10（64 位）中访问 Anaconda 命令提示符

Python 使用pandas to_datetime时如何定义格式？

Python Django - 使用 {% url "music:fav" %} 时出现错误“Reverse for 'detail' with no arguments not found. 1 模式尝试：”

Python Jupyter Notebook 500：内部服务器错误

相关推荐

最近更新

标签