pandas Python 中的偏相关
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/52229220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Partial Correlation in Python
提问by user8834780
I ran a correlation matrix:
我运行了一个相关矩阵:
sns.pairplot(data.dropna())
corr = data.dropna().corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
and it looks like advisory_pct
is fairly (0.57) negatively correlated to all_brokerage_pct
. As far as I understand, I can claim that we are fairly certain that "when advisor has low % of advisory in his portfolio, he has high % of all brokerage in his portfolio".
并且看起来与advisory_pct
相当 (0.57) 负相关all_brokerage_pct
。据我所知,我可以声称我们相当确定“当顾问在其投资组合中的咨询比例较低时,他的投资组合中所有经纪业务的比例都很高”。
However this is a "pairwise" correlation, and we are not controlling for the effect of the rest of the possible variables.
然而,这是一个“成对”相关性,我们没有控制其余可能变量的影响。
I searched SO and was not able to find how I can run a "partial correlation" where the correlation matrix can provide the correlation between every two variables- while controlling for the rest of the variables. For this purpose lets assume, brokerage %
+ etf brokerage %
+ advisory %
+ all brokerage %
= ~100% of portfolio.
我搜索了 SO,但无法找到如何运行“偏相关”,其中相关矩阵可以提供每两个变量之间的相关性 - 同时控制其余变量。为此,让我们假设brokerage %
+ etf brokerage %
+ advisory %
+ all brokerage %
= ~100% 的投资组合。
Does such function exist?
有这样的功能吗?
-- EDIT -- Running the data as per https://stats.stackexchange.com/questions/288273/partial-correlation-in-panda-dataframe-python:
-- 编辑 -- 按照https://stats.stackexchange.com/questions/288273/partial-correlation-in-panda-dataframe-python运行数据:
dict = {'x1': [1, 2, 3, 4, 5], 'x2': [2, 2, 3, 4, 2], 'x3': [10, 9, 5, 4, 9], 'y' : [5.077, 32.330, 65.140, 47.270, 80.570]}
data = pd.DataFrame(dict, columns=['x1', 'x2', 'x3', 'y'])
partial_corr_array = df.as_matrix()
data_int = np.hstack((np.ones((partial_corr_array.shape[0],1)), partial_corr_array))
print(data_int)
[[ 1. 1. 2. 10. 5.077]
[ 1. 2. 2. 9. 32.33 ]
[ 1. 3. 3. 5. 65.14 ]
[ 1. 4. 4. 4. 47.27 ]
[ 1. 5. 2. 9. 80.57 ]]
arr = np.round(partial_corr(partial_corr_array)[1:, 1:], decimals=2)
print(arr)
[[ 1. 0.99 0.99 1. ]
[ 0.99 1. -1. -0.99]
[ 0.99 -1. 1. -0.99]
[ 1. -0.99 -0.99 1. ]]
corr_df = pd.DataFrame(arr, columns = data.columns, index = data.columns)
print(corr_df)
x1 x2 x3 y
x1 1.00 0.99 0.99 1.00
x2 0.99 1.00 -1.00 -0.99
x3 0.99 -1.00 1.00 -0.99
y 1.00 -0.99 -0.99 1.00
These correlations don't make much sense. Using my real data, I get a very similar result where all correlations are rounded to -1..
这些相关性没有多大意义。使用我的真实数据,我得到了一个非常相似的结果,其中所有相关性都四舍五入为 -1..
回答by Raphael
To compute the correlation between two columns of a pandas DataFrame whilst controlling for one or more covariates (i.e. other columns in the dataframe), you can use the partial_corrfunction of the Pingouinpackage (disclaimer, of which I am the creator):
要在控制一个或多个协变量(即数据框中的其他列)的同时计算 Pandas DataFrame 的两列之间的相关性,您可以使用Pinouin包的partial_corr函数(免责声明,我是其创建者):
from pingouin import partial_corr
partial_corr(data=df, x='X', y='Y', covar=['covar1', 'covar2'], method='pearson')
回答by Matthew
AFAIK, there is no official implementation of partial correlation in scipy / numpy. As pointed out by @J. C. Rocamonde, the function from that stats website can be used to calculate partial correlation.
AFAIK,在 scipy / numpy 中没有官方实现部分相关。正如@JC Rocamonde 指出的那样,该统计网站的函数可用于计算偏相关。
I believe here's the original source:
我相信这是原始来源:
https://gist.github.com/fabianp/9396204419c7b638d38f
https://gist.github.com/fabianp/9396204419c7b638d38f
Note:
笔记:
As discussed in the github page, you may want to add a column of ones to add a bias term to your fits if your data is not standardized (Judging from your data it's not).
If I'm not mistaken, it calculates partial correlation by controlling all other remaining variables in the matrix. If you just want to control one variable, you may change
idx
to the index of that particular variable.
如 github 页面中所述,如果您的数据未标准化(从您的数据来看并非如此),您可能需要添加一列 1 来为您的拟合添加偏差项。
如果我没记错的话,它会通过控制矩阵中的所有其他剩余变量来计算偏相关。如果您只想控制一个变量,您可以更改
idx
为该特定变量的索引。
Edit 1 (How to add ones + What to do with df):
编辑 1(如何添加 + 如何处理 df):
If you look into the link, they have already discussed how to add ones.
如果您查看链接,他们已经讨论了如何添加链接。
To illustrate how it works, I added another way of hstack
, using the given data in the link:
为了说明它是如何工作的,我添加了另一种方式hstack
,使用链接中的给定数据:
data_int = np.hstack((np.ones((data.shape[0],1)), data))
test1 = partial_corr(data_int)[1:, 1:]
print(test1)
# You can also add it on the right, as long as you select the correct coefficients
data_int_2 = np.hstack((data, np.ones((data.shape[0],1))))
test2 = partial_corr(data_int_2)[:-1, :-1]
print(test2)
data_std = data.copy()
data_std -= data.mean(axis=0)[np.newaxis, :]
data_std /= data.std(axis=0)[np.newaxis, :]
test3 = partial_corr(data_std)
print(test3)
Output:
输出:
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
And if you want to maintain the columns, easiest way is to extract the columns and put them back in after calculation:
如果要维护列,最简单的方法是提取列并在计算后将它们放回:
# Assume that we have a DataFrame with columns x, y, z
data_as_df = pd.DataFrame(data, columns=['x','y','z'])
data_as_array = data_as_df.values
partial_corr_array = partial_corr(np.hstack((np.ones((data_as_array.shape[0],1)), data_as_array))
)[1:,1:]
corr_df = pd.DataFrame(partial_corr_array, columns = data_as_df.columns)
print(corr_df)
Output:
输出:
x y z
0 1.000 -0.543 -0.141
1 -0.543 1.000 -0.762
2 -0.141 -0.762 1.000
Hope it's helpful! Let me know if anything is unclear!
希望它有帮助!如果有什么不清楚的,请告诉我!
Edit 2:
编辑2:
I think the problem lies in not having constant term in each of the fits... I rewrote the code in sklearn to make it easier to add intercept:
我认为问题在于每个拟合中都没有常数项......我在 sklearn 中重写了代码,以便更容易地添加拦截:
def calculate_partial_correlation(input_df):
"""
Returns the sample linear partial correlation coefficients between pairs of variables,
controlling for all other remaining variables
Parameters
----------
input_df : array-like, shape (n, p)
Array with the different variables. Each column is taken as a variable.
Returns
-------
P : array-like, shape (p, p)
P[i, j] contains the partial correlation of input_df[:, i] and input_df[:, j]
controlling for all other remaining variables.
"""
partial_corr_matrix = np.zeros((input_df.shape[1], input_df.shape[1]));
for i, column1 in enumerate(input_df):
for j, column2 in enumerate(input_df):
control_variables = np.delete(np.arange(input_df.shape[1]), [i, j]);
if i==j:
partial_corr_matrix[i, j] = 1;
continue
data_control_variable = input_df.iloc[:, control_variables]
data_column1 = input_df[column1].values
data_column2 = input_df[column2].values
fit1 = linear_model.LinearRegression(fit_intercept=True)
fit2 = linear_model.LinearRegression(fit_intercept=True)
fit1.fit(data_control_variable, data_column1)
fit2.fit(data_control_variable, data_column2)
residual1 = data_column1 - (np.dot(data_control_variable, fit1.coef_) + fit1.intercept_)
residual2 = data_column2 - (np.dot(data_control_variable, fit2.coef_) + fit2.intercept_)
partial_corr_matrix[i,j] = stats.pearsonr(residual1, residual2)[0]
return pd.DataFrame(partial_corr_matrix, columns = input_df.columns, index = input_df.columns)
# Generating data in our minion world
test_sample = 10000;
Math_score = np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Eng_score = np.random.randint(100,600, size=test_sample) - 10 * Math_score + 20 * np.random.random(size=test_sample)
Phys_score = Math_score * 5 - Eng_score + np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Econ_score = np.random.randint(100,200, size=test_sample) + 20 * np.random.random(size=test_sample)
Hist_score = Econ_score + 100 * np.random.random(size=test_sample)
minions_df = pd.DataFrame(np.vstack((Math_score, Eng_score, Phys_score, Econ_score, Hist_score)).T,
columns=['Math', 'Eng', 'Phys', 'Econ', 'Hist'])
calculate_partial_correlation(minions_df)
Output:
输出:
---- ---------- ----------- ------------ ----------- ------------
Math 1 -0.322462 0.436887 0.0104036 -0.0140536
Eng -0.322462 1 -0.708277 0.00802087 -0.010939
Phys 0.436887 -0.708277 1 0.000340397 -0.000250916
Econ 0.0104036 0.00802087 0.000340397 1 0.721472
Hist -0.0140536 -0.010939 -0.000250916 0.721472 1
---- ---------- ----------- ------------ ----------- ------------
Please let me know if that's not working!
如果这不起作用,请告诉我!
回答by seralouk
Half-line code:
半行代码:
import numpy as np
X=np.random.normal(0,1,(5,5000)) # 5 variable stored as rows
Par_corr = -np.linalg.inv(np.corrcoef(X)) # 5x5 matrix