pandas 如何执行分类列之间的相关性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41827716/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:51:23  来源:igfitidea点击:

How to perform correlation between categorical columns

pythonpandas

提问by

I have a set of columns (col1,col2,col3) in dataframe df1 I have another set of columns (col4,col5,col6) in dataframe df2 Assume this two dataframes has the same number of rows.

我在数据帧 df1 中有一组列 (col1,col2,col3) 我在数据帧 df2 中有另一组列 (col4,col5,col6) 假设这两个数据帧具有相同的行数。

How do I generate a correlation table that do pairwise correlation between df1 and df2?

如何生成在 df1 和 df2 之间进行成对相关的相关表?

the table will look like

桌子看起来像

    col1 col2 col3
col4 ..   ..   ..
col5 ..   ..   ..
col6 ..   ..   ..

I use df1.corrwith(df2), it does not seem to generate the table as required.

我使用df1.corrwith(df2),它似乎没有按要求生成表。

I have a asked a similar question here: How to perform Correlation between two dataframes with different column namesbut now I am dealing with categorical columns.

我在这里问了一个类似的问题: How to perform Correlation between two dataframes with different column names但现在我正在处理分类列。

If it is not comparable directly, is there a standard way to make it comparable (like using get_dummies)? and is that a faster way to automatically process all fields (assume all are categorical) and calculate their correlation?

如果它不能直接比较,是否有一种标准方法可以使其具有可比性(例如使用 get_dummies)?这是自动处理所有字段(假设所有字段都是分类的)并计算它们的相关性的更快方法吗?

回答by Ted Petrou

You are correct that pd.get_dummieswould be needed to get the correlation. Below, I will create some fake data with two categorical columns and then use corrwith

您是正确的,pd.get_dummies这是获得相关性所必需的。下面,我将创建一些带有两个分类列的假数据,然后使用corrwith

df = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
                  'col2':np.random.choice(list('xyz'),100)}, dtype='category')
df1 = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
                   'col2':np.random.choice(list('xyz'),100)}, dtype='category')

dfa = pd.get_dummies(df)
dfb = pd.get_dummies(df1)
dfa.corrwith(dfb)

col1_a   -0.057735
col1_b    0.002513
col1_c    0.137956
col1_d   -0.095050
col1_e   -0.114022
col2_x    0.022568
col2_y   -0.081699
col2_z   -0.128350