pandas 使用pandas,计算Cramér的系数矩阵

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20892799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:30:52  来源:igfitidea点击:

Using pandas, calculate Cramér's coefficient matrix

pythonpandasstatistics

提问by notconfusing

I have a dataframe in pandaswhich contains metrics calculated on Wikipedia articles. Two categorical variables nationwhich nation the article is about, and langwhich language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

我有一个数据框,pandas其中包含根据维基百科文章计算的指标。nation文章涉及哪个国家的两个分类变量,以及lang该文章来自维基百科的哪种语言。对于单个指标,我想了解民族和语言变量的相关程度,我相信这是使用 Cramer 的统计数据完成的。

index   qid     subj    nation  lang    metric          value
5   Q3488399    economy     cdi     fr  informativeness 0.787117
6   Q3488399    economy     cdi     fr  referencerate   0.000945
7   Q3488399    economy     cdi     fr  completeness    43.200000
8   Q3488399    economy     cdi     fr  numheadings     11.000000
9   Q3488399    economy     cdi     fr  articlelength   3176.000000
10  Q7195441    economy     cdi     en  informativeness 0.626570
11  Q7195441    economy     cdi     en  referencerate   0.008610
12  Q7195441    economy     cdi     en  completeness    6.400000
13  Q7195441    economy     cdi     en  numheadings     7.000000
14  Q7195441    economy     cdi     en  articlelength   2323.000000

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga']and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

我想生成一个矩阵,显示国家(法国、美国、科特迪瓦和乌干达)['fra','usa','uga']和三种语言的所有组合之间的克莱默系数['fr','en','sw']。所以会有一个 4 x 3 的矩阵,如:

       en         fr          sw
usa    Cramer11   Cramer12    ... 
fra    Cramer21   Cramer22    ... 
cdi    ...
uga    ...

Eventually then I will do this over all the different metrics I am tracking.

最终,我将对我正在跟踪的所有不同指标执行此操作。

for subject in list_of_subjects:
    for metric in list_of_metrics:
        cramer_matrix(metric, df)

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks

然后我可以测试我的假设,即语言是维基百科语言的文章的指标会更高。谢谢

回答by Ziggy Eunicien

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

在我所做的一些测试中,cramers V 似乎过于乐观了。维基百科推荐了一个更正的版本。

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

另请注意,可以通过以下方式通过用于分类列的内置 Pandas 方法计算混淆矩阵:

import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])

回答by RomanS

Cramer's V statistic allows to understand correlation between two categorical features in one data set. So, it is your case.

Cramer 的 V 统计量允许了解一个数据集中两个分类特征之间的相关性。所以,这是你的情况。

To calculate Cramers V statistic you need to calculate confusion matrix. So, solution steps are:
1. Filter data for a single metric
2. Calculate confusion matrix
3. Calculate Cramers V statistic

要计算 Cramers V 统计量,您需要计算混淆矩阵。因此,解决方案的步骤是:
1. 过滤单个度量的数据
2. 计算混淆矩阵
3. 计算 Cramers V 统计量

Of course, you can do those steps in loop nest provided in your post. But in your starting paragraph you mention only metrics as an outer parameter, so I am not sure that you need both loops. Now, I will provide code for steps 2-3, because filtering is simple and as I mentioned I am not sure what you certainely need.

当然,您可以在帖子中提供的循环嵌套中执行这些步骤。但是在您的开始段落中,您只提到了指标作为外部参数,所以我不确定您是否需要两个循环。现在,我将提供第 2-3 步的代码,因为过滤很简单,而且正如我所提到的,我不确定您肯定需要什么。

Step 2.In the code below datais a pandas.dataFramefiltered by whatever you want on step 1.

第2步:在下面的代码datapandas.dataFrame由你想在步骤1中的任何过滤。

import numpy as np

confusions = []
for nation in list_of_nations:
    for language in list_of_languges:
        cond = data['nation'] == nation and data['lang'] == language
        confusions.append(cond.sum())
confusion_matrix = np.array(confusions).reshape(len(list_of_nations), len(list_of_languges))

Step 3.In the code below confusion_matrixis a numpy.ndarrayobtained on step 2.

步骤 3.下面的代码confusion_matrixnumpy.ndarray在步骤 2 中获得的。

import numpy as np
import scipy.stats as ss

def cramers_stat(confusion_matrix):
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))

result = cramers_stat(confusion_matrix)

This code was tested on my data set, but I hope it is ok to use it without changes in your case.

此代码已在我的数据集上进行了测试,但我希望可以在不更改您的情况下使用它。

回答by Yury Wallet

A bit modificated function from Ziggy Eunicien answer. 2 modifications added 1) cheching one variable is constant 2) correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

Ziggy Eunicien 答案中的一些修改功能。添加了 2 个修改 1) 校验一个变量是常数 2) 修正 ss.chi2_contingency(conf_matrix, Correct=correct) - 如果混淆矩阵是 2x2,则为 FALSE

import scipy.stats as ss
import pandas as pd
import numpy as np
def cramers_corrected_stat(x,y):

    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    result=-1
    if len(x.value_counts())==1 :
        print("First variable is constant")
    elif len(y.value_counts())==1:
        print("Second variable is constant")
    else:   
        conf_matrix=pd.crosstab(x, y)

        if conf_matrix.shape[0]==2:
            correct=False
        else:
            correct=True

        chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]

        n = sum(conf_matrix.sum())
        phi2 = chi2/n
        r,k = conf_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    return round(result,6)