Python 具有统计显着性的熊猫列相关性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25571882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas columns correlation with statistical significance
提问by wolfsatthedoor
What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1and df.2?
给定一个 Pandas 数据框 df,获得其列df.1与之间的相关性的最佳方法是什么df.2?
I do not want the output to count rows with NaN, which pandasbuilt-in correlation does. But I also want it to output a pvalueor a standard error, which the built-in does not.
我不希望输出用 来计算行数NaN,而pandas内置相关性是这样做的。但我也希望它输出一个pvalue或一个标准错误,而内置的则没有。
SciPyseems to get caught up by the NaNs, though I believe it does report significance.
SciPy似乎被 NaN 抓住了,尽管我相信它确实报告了重要性。
Data example:
数据示例:
1 2
0 2 NaN
1 NaN 1
2 1 2
3 -4 3
4 1.3 1
5 NaN NaN
采纳答案by BKay
Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:
@Shashank 提供的答案很好。但是,如果您想要 pure 解决方案pandas,您可能会喜欢这样:
import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats as stats
gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))
#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])
Results:
结果:
-0.0422917932738 0.851762475093
Same results as stats function:
结果与 stats 函数相同:
#Do it with stats functions.
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])
Results:
结果:
(-0.042291793273791969, 0.85176247509284908)
To extend to more vairables I give you an ugly loop based approach:
为了扩展到更多变量,我给你一个丑陋的基于循环的方法:
#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1)))
df = df.merge(oil,left_index=True, right_index=True, how='left')
#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
for j in range(df.shape[1]):
JonI = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
pval[i,j] = JonI.f_stat['p-value']
Results of rho:
rho 的结果:
GDP VIXCLS DCOILWTICO
GDP 1.000000 -0.042292 0.870251
VIXCLS -0.042292 1.000000 -0.004612
DCOILWTICO 0.870251 -0.004612 1.000000
Results of pval:
pval 的结果:
[[ 0.00000000e+00 8.51762475e-01 1.11022302e-16]
[ 8.51762475e-01 0.00000000e+00 9.83747425e-01]
[ 1.11022302e-16 9.83747425e-01 0.00000000e+00]]
回答by Shashank Agarwal
You can use the scipy.statscorrelation functions to get the p-value.
您可以使用scipy.stats相关函数来获取 p 值。
For example, if you are looking for a correlation such as pearson correlation, you can use the pearsonrfunction.
例如,如果您要查找 pearson 相关性等相关性,则可以使用pearsonr函数。
from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])
Gives output
给出输出
(0.7205766921228921, 0.48775429164459994)
Where the first value in the tuple is the correlation value, and second is the p-value.
其中元组中的第一个值是相关值,第二个是 p 值。
In your case, you can use pandas' dropnafunction to remove NaNvalues first.
在您的情况下,您可以使用 pandas 的dropna函数先删除NaN值。
df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])
回答by Somendra Joshi
I have tried to sum the logic in a function, it might not be the most efficient approach but will provide you with a similar output as pandas df.corr(). To use this just put the following function in your code and call it providing your dataframe object ie. corr_pvalue(your_dataframe).
我试图在函数中总结逻辑,它可能不是最有效的方法,但会为您提供与 pandas df.corr() 类似的输出。要使用它,只需将以下函数放入您的代码中,并调用它提供您的数据帧对象,即。corr_pvalue(your_dataframe)。
I have rounded the values to 4 decimal place, in case you want different output please change the value in round function.
我已将值四舍五入到小数点后 4 位,如果您想要不同的输出,请更改 round 函数中的值。
from scipy.stats import pearsonr
import numpy as np
import pandas as pd
def corr_pvalue(df):
numeric_df = df.dropna()._get_numeric_data()
cols = numeric_df.columns
mat = numeric_df.values
arr = np.zeros((len(cols),len(cols)), dtype=object)
for xi, x in enumerate(mat.T):
for yi, y in enumerate(mat.T[xi:]):
arr[xi, yi+xi] = map(lambda _: round(_,4), pearsonr(x,y))
arr[yi+xi, xi] = arr[xi, yi+xi]
return pd.DataFrame(arr, index=cols, columns=cols)
I have tested it with pandas v0.18.1
我已经用 Pandas v0.18.1 测试过了
回答by toto_tico
To calculate all the p-values at once, you can use calculate_pvaluesfunction(code below):
要一次计算所有 p 值,您可以使用calculate_pvalues函数(下面的代码):
df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df)
The output is similar to the corr()(but with p-values):
输出类似于corr()(但具有 p 值):
A B C
A 0 0.7877 0.1789
B 0.7877 0 0.6088
C 0.1789 0.6088 0
Details:
细节:
- Column D is automatically ignoredas it contains text.
- p-values are rounded to 4 decimals
- You can subset to indicate exact columns:
calculate_pvalues(df[['A','B','C']]
- 由于 D 列包含文本,因此会被自动忽略。
- p 值四舍五入到 4 位小数
- 您可以子集以指示确切的列:
calculate_pvalues(df[['A','B','C']]
Following is the code of the function:
以下是函数的代码:
from scipy.stats import pearsonr
import pandas as pd
def calculate_pvalues(df):
df = df.dropna()._get_numeric_data()
dfcols = pd.DataFrame(columns=df.columns)
pvalues = dfcols.transpose().join(dfcols, how='outer')
for r in df.columns:
for c in df.columns:
pvalues[r][c] = round(pearsonr(df[r], df[c])[1], 4)
return pvalues
回答by tozCSS
rho = df.corr()
rho = rho.round(2)
pval = calculate_pvalues(df) # toto_tico's answer
# create three masks
r1 = rho.applymap(lambda x: '{}*'.format(x))
r2 = rho.applymap(lambda x: '{}**'.format(x))
r3 = rho.applymap(lambda x: '{}***'.format(x))
# apply them where appropriate
rho = rho.mask(pval<=0.1,r1)
rho = rho.mask(pval<=0.05,r2)
rho = rho.mask(pval<=0.01,r3)
rho
# note I prefer readability over the conciseness of code,
# instead of six lines it could have been a single liner like this:
# [rho.mask(pval<=p,rho.applymap(lambda x: '{}*'.format(x)),inplace=True) for p in [.1,.05,.01]]
回答by user2730303
That was very useful code by oztalha. I just changed formatting (rounded to 2 digits) wherever r was not significant.
这是oztalha非常有用的代码。我只是在 r 不重要的地方更改了格式(四舍五入为 2 位数)。
rho = data.corr()
pval = calculate_pvalues(data) # toto_tico's answer
# create three masks
r1 = rho.applymap(lambda x: '{:.2f}*'.format(x))
r2 = rho.applymap(lambda x: '{:.2f}**'.format(x))
r3 = rho.applymap(lambda x: '{:.2f}***'.format(x))
r4 = rho.applymap(lambda x: '{:.2f}'.format(x))
# apply them where appropriate --this could be a single liner
rho = rho.mask(pval>0.1,r4)
rho = rho.mask(pval<=0.1,r1)
rho = rho.mask(pval<=0.05,r2)
rho = rho.mask(pval<=0.01,r3)
rho
回答by Matheus Araujo
Great answers from @toto_tico and @Somendra-joshi. However, it drops unnecessary NAs values. In this snippet, I'm just dropping the NAs that belong to the correlation being computing at the moment. In the actual corr implementation, they do the same.
来自@toto_tico 和@Somendra-joshi 的精彩回答。但是,它会删除不必要的 NAs 值。在这个片段中,我只是删除了属于当前正在计算的相关性的 NA。在实际的corr 实现中,它们也是如此。
def calculate_pvalues(df):
df = df._get_numeric_data()
dfcols = pd.DataFrame(columns=df.columns)
pvalues = dfcols.transpose().join(dfcols, how='outer')
for r in df.columns:
for c in df.columns:
if c == r:
df_corr = df[[r]].dropna()
else:
df_corr = df[[r,c]].dropna()
pvalues[r][c] = pearsonr(df_corr[r], df_corr[c])[1]
return pvalues
回答by Fabian Rost
In pandas v0.24.0 a methodargument was added to corr. Now, you can do:
在 pandas v0.24.0 中,一个method参数被添加到corr. 现在,你可以这样做:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})
df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns))
A B C
A 0.000000 0.787704 0.178912
B 0.787704 0.000000 0.608792
C 0.178912 0.608792 0.000000
Please note the workaround with np.eye(len(df.columns))which is needed, because self-correlations are always set to 1.0(see https://github.com/pandas-dev/pandas/issues/25726).
请注意所需的解决方法np.eye(len(df.columns)),因为自相关始终设置为1.0(请参阅https://github.com/pandas-dev/pandas/issues/25726)。


