pandas 如何关联熊猫中的序数分类列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47894387/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:55:26  来源:igfitidea点击:

How to correlate an Ordinal Categorical column in pandas?

pythonpandasscikit-learncorrelationcategorical-data

提问by yousraHazem

I have a DataFrame dfwith a non-numerical column CatColumn.

我有一个df带有非数字列的 DataFrame CatColumn

   A         B         CatColumn
0  381.1396  7.343921  Medium
1  481.3268  6.786945  Medium
2  263.3766  7.628746  High
3  177.2400  5.225647  Medium-High

I want to include CatColumnin the correlation analysis with other columns in the Dataframe. I tried DataFrame.corrbut it does not include columns with nominal values in the correlation analysis.

我想CatColumn与 Dataframe 中的其他列进行相关性分析。我试过,DataFrame.corr但它不包括相关分析中具有标称值的列。

回答by FatihAkici

I am going to stronglydisagree with the other comments.

我将强烈反对其他评论。

They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.

他们忽略了相关性的要点:随着变量 2 的增加或减少,变量 1 增加或减少了多少。因此,首先,在分解/编码期间必须保留序数变量的顺序。如果您改变变量的顺序,相关性将完全改变。如果您正在构建基于树的方法,这不是问题,但对于相关分析,必须特别注意顺序变量中的顺序保持。

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

让我使我的论点可重复。A 和 B 是数字,C 是下表中的序数分类,有意与问题中的那个略有不同。

rawText = StringIO("""
 A         B         C
0  100.1396  1.343921  Medium
1  105.3268  1.786945  Medium
2  200.3766  9.628746  High
3  150.2400  4.225647  Medium-High
""")
myData = pd.read_csv(rawText, sep = "\s+")

Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:

注意:随着 C 从 Medium 移动到 Medium-High 再到 High,A 和 B 都单调增加。因此我们应该看到元组 (C,A) 和 (C,B) 之间的强相关性。让我们重现两个建议的答案:

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
          A         B         C
A  1.000000  0.986493 -0.438466
B  0.986493  1.000000 -0.579650
C -0.438466 -0.579650  1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

等等……什么?负相关?怎么来的?有些事情肯定是不对的。那么发生了什么?

What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:

发生的事情是 C 根据其值的字母数字排序被分解。[High, Medium, Medium-High] 被分配为 [0, 1, 2],因此排序发生了变化:0 < 1 < 2 意味着 High < Medium < Medium-High,这是不正确的。因此,当 C 从高到中再到中高时,我们意外地计算了 A 和 B 的响应。正确答案必须保持排序,并将 [2, 0, 1] 分配给 [高、中、中-高]。方法如下:

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
          A         B         C
A  1.000000  0.986493  0.998874
B  0.986493  1.000000  0.982982
C  0.998874  0.982982  1.000000

Much better!

好多了!

Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.

注1:如果你想把你的变量当作名义变量,你可以看看列联表、Cramer's V等;或按名义类别等对连续变量进行分组。不过,我认为这是不对的。

Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

注2:如果您有另一个名为“低”的类别,我的回答可能会受到批评,因为我将等间距的数字分配给不等间距的类别。您可以论证应该将 [2, 1, 1.5, 0] 分配给 [High, Medium, Medium-High, Small],这是有效的。我相信这就是人们所说的数据科学的艺术部分。

回答by cy-press

Basically, there is no a good scientifical way to do it. I would use the following approach: 1. Split the numeric field into n groups, where n = number of groups of the categorical field. 2. Calculate Cramer correlation between the 2 categorical fields.

基本上,没有一个好的科学方法来做到这一点。我将使用以下方法: 1. 将数字字段拆分为 n 组,其中 n = 分类字段的组数。2. 计算 2 个分类字段之间的 Cramer 相关性。

回答by ei-grad

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

将分类列与 N 个值相关联的正确方法是将此列拆分为 N 个单独的布尔列。

Lets take the original question dataframe. Make the category columns:

让我们采用原始问题数据框。制作类别列:

for i in df.CatColumn.astype('category'):
    df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

然后可以计算每个类别与其他列之间的相关性:

df.corr()

Output:

输出:

                    A         B    Medium      High  Medium-High
A            1.000000  0.490608  0.914322 -0.312309    -0.743459
B            0.490608  1.000000  0.343620  0.548589    -0.945367
Medium       0.914322  0.343620  1.000000 -0.577350    -0.577350
High        -0.312309  0.548589 -0.577350  1.000000    -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333     1.000000