两个 Pandas 数据框的相关矩阵,具有 P 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42885239/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:13:37  来源:igfitidea点击:

Correlation matrix of two Pandas dataframe, with P values

pythonpandasmatrixdataframecorrelation

提问by valeten

I was using this function (see bottom) to calculate both Pearson and Pval starting from two dataframes, but I am not confident with Pval results: it seems that too many negative correlations are significant.

我使用这个函数(见底部)从两个数据帧开始计算 Pearson 和 Pval,但我对 Pval 结果没有信心:似乎太多的负相关很重要。

Is there a more elegant way (like one-line-code), in order to calculate Pval along with Pearson?

是否有更优雅的方式(如单行代码),以便与 Pearson 一起计算 Pval?

These two answers (pandas.DataFrame corrwith() method) and (correlation matrix of one dataframe with another) provided elegant solutions, but P values calculation is missing.

这两个答案(pandas.DataFrame corrwith() 方法)和(一个数据帧与另一个数据帧的相关矩阵)提供了优雅的解决方案,但缺少 P 值计算。

Here is the code:

这是代码:

def pearson_cross_map(df1, df2):
    """Correlate each Mvar with each Nvar.

    Parameters
    ----------
    df1 : dataframe1
    Shape Mobs X Mvar.

    df2 : dataframe2
    Shape Nobs X Nvar.

    Returns
    -------
    DFcorr, dataframe Mvar x Nvar in which each element is a Pearson 
correlation coefficient.
    DFpval, dataframe Mvar x Nvar in which each element is a P value (one-tailed).

    """

    intersection = (df1.index & df2.index).tolist()
    df1 = df1.convert_objects(convert_numeric=True) 
    df1 = df1.T[intersection].T 
    df1 = df1.loc[:, (df1 != 0).any(axis=0)].sort().sort(axis=1)    
    df2 = df2.convert_objects(convert_numeric=True)
    df2 = df2.T[intersection].T
    df2 = df2.loc[:, (df2 != 0).any(axis=0)].sort().sort(axis=1)
    x = df1.T.values
    y = df2.T.values
    mu_x = x.mean(1)
    mu_y = y.mean(1)
    n = x.shape[1]
    s_x = x.std(1, ddof=n - 1)
    s_y = y.std(1, ddof=n - 1)
    cov = np.dot(x,y.T) - n * np.dot(mu_x[:, np.newaxis], mu_y[np.newaxis, :])
    DFcoeff = pd.DataFrame(cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :]))
    DFcoeff.index = df1.columns.tolist()
    DFcoeff.columns = df2.columns.tolist()
    n = len(intersection)
    r = DFcoeff
    t = r*np.sqrt((n-2)/(1-r*r))
    DFpval = pd.DataFrame(stats.t.cdf(t, n-2))
    DFpval.index = df1.columns.tolist()
    DFpval.columns = df2.columns.tolist()
    return DFcoeff, DFpval

Thank you!

谢谢!

回答by Parfait

You require Pearson correlation testing and not just correlation calculation. Hence, use the scipy.stats.pearsonrmethod which returns the estimated Pearson coefficient and 2-tailed pvalue.

您需要 Pearson 相关性测试,而不仅仅是相关性计算。因此,使用scipy.stats.pearsonr方法返回估计的 Pearson 系数和 2-tailed pvalue。

Since the method requires a series input, consider iterating through each column of both dataframes to update pre-assigned matrices. Even cast to dataframe with needed columns and index:

由于该方法需要一系列输入,请考虑迭代两个数据帧的每一列以更新预先分配的矩阵。甚至转换为具有所需列和索引的数据框:

import numpy as np
import pandas as pd
from scipy.stats import pearsonr

df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))

for i in range(df1.shape[1]):    
    for j in range(df2.shape[1]):        
        corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])  

        coeffmat[i,j] = corrtest[0]
        pvalmat[i,j] = corrtest[1]

dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
#           Col1      Col2      Col3      Col4      Col5
# Col1 -0.791083  0.459101 -0.488463 -0.289265  0.494897
# Col2  0.059446 -0.395072  0.310900  0.297532  0.201669
# Col3 -0.062592  0.391469 -0.450600 -0.136554  0.299579
# Col4 -0.470203  0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053  0.042798  0.176966 -0.157344

dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
#           Col1      Col2      Col3      Col4      Col5
# Col1  0.006421  0.181967  0.152007  0.417574  0.145871
# Col2  0.870421  0.258506  0.381919  0.403770  0.576357
# Col3  0.863615  0.263268  0.191245  0.706796  0.400385
# Col4  0.170260  0.005666  0.592096  0.338101  0.496668
# Col5  0.873881  0.919058  0.906551  0.624783  0.664206

回答by Krystian Zawistowski

You could compare this with bootstrap significance (i.e. if you shuffle randomly one series, what is the probability that you will get the same or greater correlation). This is not the same thing as Pearson's p-value as the latter was derived with assumption that your data is normally distributed, so you could get somewhat different result if it is not the case.

您可以将其与 bootstrap 显着性进行比较(即,如果您随机洗牌一个系列,您获得相同或更大相关性的概率是多少)。这与 Pearson 的 p 值不同,因为后者是在假设您的数据呈正态分布的情况下得出的,因此如果不是这种情况,您可能会得到一些不同的结果。

bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng

bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen

print("correlation is {} with significance {}".format(correlation,significance))