Python 具有统计显着性的熊猫列相关性

Question

提问by wolfsatthedoor

What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1and df.2?

给定一个 Pandas 数据框 df，获得其列df.1与之间的相关性的最佳方法是什么df.2？

I do not want the output to count rows with NaN, which pandasbuilt-in correlation does. But I also want it to output a pvalueor a standard error, which the built-in does not.

我不希望输出用来计算行数NaN，而pandas内置相关性是这样做的。但我也希望它输出一个pvalue或一个标准错误，而内置的则没有。

SciPyseems to get caught up by the NaNs, though I believe it does report significance.

SciPy似乎被 NaN 抓住了，尽管我相信它确实报告了重要性。

Data example:

数据示例：

     1           2
0    2          NaN
1    NaN         1
2    1           2
3    -4          3
4    1.3         1
5    NaN         NaN

Answer 1

采纳答案by BKay

Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:

@Shashank 提供的答案很好。但是，如果您想要 pure 解决方案pandas，您可能会喜欢这样：

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Results:

结果：

-0.0422917932738 0.851762475093

Same results as stats function:

结果与 stats 函数相同：

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Results:

结果：

  (-0.042291793273791969, 0.85176247509284908)

To extend to more vairables I give you an ugly loop based approach:

为了扩展到更多变量，我给你一个丑陋的基于循环的方法：

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

Results of rho:

rho 的结果：

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

Results of pval:

pval 的结果：

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]

Answer 2

回答by Shashank Agarwal

You can use the scipy.statscorrelation functions to get the p-value.

您可以使用scipy.stats相关函数来获取 p 值。

For example, if you are looking for a correlation such as pearson correlation, you can use the pearsonrfunction.

例如，如果您要查找 pearson 相关性等相关性，则可以使用pearsonr函数。

from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])

Gives output

给出输出

(0.7205766921228921, 0.48775429164459994)

Where the first value in the tuple is the correlation value, and second is the p-value.

其中元组中的第一个值是相关值，第二个是 p 值。

In your case, you can use pandas' dropnafunction to remove NaNvalues first.

在您的情况下，您可以使用 pandas 的dropna函数先删除NaN值。

df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])

Answer 3

回答by Somendra Joshi

I have tried to sum the logic in a function, it might not be the most efficient approach but will provide you with a similar output as pandas df.corr(). To use this just put the following function in your code and call it providing your dataframe object ie. corr_pvalue(your_dataframe).

我试图在函数中总结逻辑，它可能不是最有效的方法，但会为您提供与 pandas df.corr() 类似的输出。要使用它，只需将以下函数放入您的代码中，并调用它提供您的数据帧对象，即。corr_pvalue(your_dataframe)。

I have rounded the values to 4 decimal place, in case you want different output please change the value in round function.

我已将值四舍五入到小数点后 4 位，如果您想要不同的输出，请更改 round 函数中的值。

from scipy.stats import pearsonr
import numpy as np
import pandas as pd

def corr_pvalue(df):


    numeric_df = df.dropna()._get_numeric_data()
    cols = numeric_df.columns
    mat = numeric_df.values

    arr = np.zeros((len(cols),len(cols)), dtype=object)

    for xi, x in enumerate(mat.T):
        for yi, y in enumerate(mat.T[xi:]):
            arr[xi, yi+xi] = map(lambda _: round(_,4), pearsonr(x,y))
            arr[yi+xi, xi] = arr[xi, yi+xi]

    return pd.DataFrame(arr, index=cols, columns=cols)

I have tested it with pandas v0.18.1

我已经用 Pandas v0.18.1 测试过了

Answer 4

回答by toto_tico

To calculate all the p-values at once, you can use calculate_pvaluesfunction(code below):

要一次计算所有 p 值，您可以使用calculate_pvalues函数（下面的代码）：

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df)

The output is similar to the corr()(but with p-values):

输出类似于corr()（但具有 p 值）：

            A       B       C
    A       0  0.7877  0.1789
    B  0.7877       0  0.6088
    C  0.1789  0.6088       0

Details:

细节：

Column D is automatically ignoredas it contains text.
p-values are rounded to 4 decimals
You can subset to indicate exact columns: calculate_pvalues(df[['A','B','C']]

由于 D 列包含文本，因此会被自动忽略。
p 值四舍五入到 4 位小数
您可以子集以指示确切的列： calculate_pvalues(df[['A','B','C']]

Following is the code of the function:

以下是函数的代码：

from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    df = df.dropna()._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
    return pvalues

Answer 5

回答by tozCSS

rho = df.corr()
rho = rho.round(2)
pval = calculate_pvalues(df) # toto_tico's answer
# create three masks
r1 = rho.applymap(lambda x: '{}*'.format(x))
r2 = rho.applymap(lambda x: '{}**'.format(x))
r3 = rho.applymap(lambda x: '{}***'.format(x))
# apply them where appropriate
rho = rho.mask(pval<=0.1,r1)
rho = rho.mask(pval<=0.05,r2)
rho = rho.mask(pval<=0.01,r3)
rho
# note I prefer readability over the conciseness of code, 
# instead of six lines it could have been a single liner like this:
# [rho.mask(pval<=p,rho.applymap(lambda x: '{}*'.format(x)),inplace=True) for p in [.1,.05,.01]]

Answer 6

回答by user2730303

That was very useful code by oztalha. I just changed formatting (rounded to 2 digits) wherever r was not significant.

这是oztalha非常有用的代码。我只是在 r 不重要的地方更改了格式（四舍五入为 2 位数）。

    rho = data.corr()
    pval = calculate_pvalues(data) # toto_tico's answer
    # create three masks
    r1 = rho.applymap(lambda x: '{:.2f}*'.format(x))
    r2 = rho.applymap(lambda x: '{:.2f}**'.format(x))
    r3 = rho.applymap(lambda x: '{:.2f}***'.format(x))
    r4 = rho.applymap(lambda x: '{:.2f}'.format(x))
    # apply them where appropriate --this could be a single liner
    rho = rho.mask(pval>0.1,r4)
    rho = rho.mask(pval<=0.1,r1)
    rho = rho.mask(pval<=0.05,r2)
    rho = rho.mask(pval<=0.01,r3)
    rho

Answer 7

回答by Matheus Araujo

Great answers from @toto_tico and @Somendra-joshi. However, it drops unnecessary NAs values. In this snippet, I'm just dropping the NAs that belong to the correlation being computing at the moment. In the actual corr implementation, they do the same.

来自@toto_tico 和@Somendra-joshi 的精彩回答。但是，它会删除不必要的 NAs 值。在这个片段中，我只是删除了属于当前正在计算的相关性的 NA。在实际的corr 实现中，它们也是如此。

def calculate_pvalues(df):
    df = df._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            if c == r:
                df_corr = df[[r]].dropna()
            else:
                df_corr = df[[r,c]].dropna()
            pvalues[r][c] = pearsonr(df_corr[r], df_corr[c])[1]
    return pvalues

Answer 8

回答by Fabian Rost

In pandas v0.24.0 a methodargument was added to corr. Now, you can do:

在 pandas v0.24.0 中，一个method参数被添加到corr. 现在，你可以这样做：

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})

df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))

          A         B         C
A  0.000000  0.787704  0.178912
B  0.787704  0.000000  0.608792
C  0.178912  0.608792  0.000000

Please note the workaround with np.eye(len(df.columns))which is needed, because self-correlations are always set to 1.0(see https://github.com/pandas-dev/pandas/issues/25726).

请注意所需的解决方法np.eye(len(df.columns))，因为自相关始终设置为1.0（请参阅https://github.com/pandas-dev/pandas/issues/25726）。

Python 具有统计显着性的熊猫列相关性

提问by wolfsatthedoor

采纳答案by BKay

回答by Shashank Agarwal

回答by Somendra Joshi

回答by toto_tico

Following is the code of the function:

以下是函数的代码：

回答by tozCSS

回答by user2730303

回答by Matheus Araujo

回答by Fabian Rost

相关推荐

最近更新

标签

Python 具有统计显着性的熊猫列相关性

提问by wolfsatthedoor

采纳答案by BKay

回答by Shashank Agarwal

回答by Somendra Joshi

回答by toto_tico

Following is the code of the function:

以下是函数的代码：

回答by tozCSS

回答by user2730303

回答by Matheus Araujo

回答by Fabian Rost

相关推荐

Python NLTK WordNet Lemmatizer：它不应该对一个词的所有变形进行词形还原吗？

Python 如何将空值存储为整数字段

Python 3.4.1 打印新行

python 类初始化器中的可选参数

相关推荐

最近更新

标签