pandas 如何执行分类列之间的相关性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41827716/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to perform correlation between categorical columns
提问by
I have a set of columns (col1,col2,col3) in dataframe df1 I have another set of columns (col4,col5,col6) in dataframe df2 Assume this two dataframes has the same number of rows.
我在数据帧 df1 中有一组列 (col1,col2,col3) 我在数据帧 df2 中有另一组列 (col4,col5,col6) 假设这两个数据帧具有相同的行数。
How do I generate a correlation table that do pairwise correlation between df1 and df2?
如何生成在 df1 和 df2 之间进行成对相关的相关表?
the table will look like
桌子看起来像
col1 col2 col3
col4 .. .. ..
col5 .. .. ..
col6 .. .. ..
I use df1.corrwith(df2)
, it does not seem to generate the table as required.
我使用df1.corrwith(df2)
,它似乎没有按要求生成表。
I have a asked a similar question here: How to perform Correlation between two dataframes with different column namesbut now I am dealing with categorical columns.
我在这里问了一个类似的问题: How to perform Correlation between two dataframes with different column names但现在我正在处理分类列。
If it is not comparable directly, is there a standard way to make it comparable (like using get_dummies)? and is that a faster way to automatically process all fields (assume all are categorical) and calculate their correlation?
如果它不能直接比较,是否有一种标准方法可以使其具有可比性(例如使用 get_dummies)?这是自动处理所有字段(假设所有字段都是分类的)并计算它们的相关性的更快方法吗?
回答by Ted Petrou
You are correct that pd.get_dummies
would be needed to get the correlation. Below, I will create some fake data with two categorical columns and then use corrwith
您是正确的,pd.get_dummies
这是获得相关性所必需的。下面,我将创建一些带有两个分类列的假数据,然后使用corrwith
df = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
'col2':np.random.choice(list('xyz'),100)}, dtype='category')
df1 = pd.DataFrame({'col1':np.random.choice(list('abcde'),100),
'col2':np.random.choice(list('xyz'),100)}, dtype='category')
dfa = pd.get_dummies(df)
dfb = pd.get_dummies(df1)
dfa.corrwith(dfb)
col1_a -0.057735
col1_b 0.002513
col1_c 0.137956
col1_d -0.095050
col1_e -0.114022
col2_x 0.022568
col2_y -0.081699
col2_z -0.128350