pandas 分类特征相关性

Question

提问by user8653080

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

我的数据中有一些分类特征以及连续特征。对类别特征进行热编码以找到它与标签以及其他连续生物的相关性是一个好主意还是绝对坏主意？

Answer 1

回答by Keiku

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrixFor variables with other continuous values, you can categorize by using cutof pandas.

有一种方法可以计算相关系数，而无需对类别变量进行单热编码。Cramers V 统计量是一种计算分类变量相关性的方法。它可以计算如下。以下链接很有帮助。使用pandas，计算Cramér的系数矩阵对于其他连续值的变量，可以使用cutof进行分类pandas。

import pandas as pd
import numpy as np
import scipy.stats as ss
import seaborn as sns

tips = sns.load_dataset("tips")

tips["total_bill_cut"] = pd.cut(tips["total_bill"],
                                np.arange(0, 55, 5),
                                include_lowest=True,
                                right=False)

def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

confusion_matrix = pd.crosstab(tips["day"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[10]: 0.93866193407222209

confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]).as_matrix()
cramers_v(confusion_matrix)
# Out[24]: 0.16498707494988371

Answer 2

回答by Aleksey Vlasenko

I was looking to do same thing in BigQuery. For numeric features you can use built in CORR(x,y) function. For categorical features, you can calculate it as: cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2). Which translates to following SQL:

我希望在 BigQuery 中做同样的事情。对于数字特征，您可以使用内置的 CORR(x,y) 函数。对于分类特征，您可以将其计算为：cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2)。转换为以下 SQL：

SELECT 
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...

Higher number means lower correlation.

较高的数字意味着较低的相关性。

I used following python script to generate SQL:

我使用以下 python 脚本生成 SQL：

import itertools

arr = range(1,10)

query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b) 
  for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM  `...`;'
print (query)

It should be straightforward to do same thing in numpy.

在 numpy 中做同样的事情应该很简单。

pandas 分类特征相关性

提问by user8653080

回答by Keiku

回答by Aleksey Vlasenko

相关推荐

最近更新

标签

pandas 分类特征相关性

提问by user8653080

回答by Keiku

回答by Aleksey Vlasenko

相关推荐

pandas 在python中将字符串转换为浮点数的问题

python Pandas DataFrame copy(deep=False) vs copy(deep=True) vs '='

Python Pandas：AttributeError：'str'对象没有属性'loc'

pandas Numpy dtype - 无法理解数据类型

相关推荐

最近更新

标签