Python DataFrame 中列之间的相关性
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15854878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Correlation between columns in DataFrame
提问by Zach Moshe
I'm pretty new to pandas, so I guess I'm doing something wrong -
我对熊猫很陌生,所以我想我做错了什么 -
I have a DataFrame:
我有一个数据框:
a b
0 0.5 0.75
1 0.5 0.75
2 0.5 0.75
3 0.5 0.75
4 0.5 0.75
df.corr()gives me:
df.corr()给我:
a b
a NaN NaN
b NaN NaN
but np.correlate(df["a"], df["b"])gives: 1.875
但np.correlate(df["a"], df["b"])给出:1.875
Why is that?
I want to have the correlation matrix for my DataFrame and thought that corr()does that (at least according to the documentation). Why does it return NaN?
这是为什么?我想要我的 DataFrame 的相关矩阵,并认为这样corr()做(至少根据文档)。为什么会返回NaN?
What's the correct way to calculate?
正确的计算方法是什么?
Many thanks!
非常感谢!
采纳答案by unutbu
np.correlatecalculates the (unnormalized) cross-correlationbetween two 1-dimensional sequences:
np.correlate计算两个一维序列之间的(非标准化)互相关:
z[k] = sum_n a[n] * conj(v[n+k])
while df.corr(by default) calculates the Pearson correlation coefficient.
而df.corr(默认情况下)计算Pearson 相关系数。
The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.
相关系数(如果存在)始终介于 -1 和 1 之间(含)。互相关是无界的。
The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.
这些公式有些相关,但请注意,在互相关公式(上面)中,没有减去均值,也没有除以标准差,这是 Pearson 相关系数公式的一部分。
The fact that the standard deviation of df['a']and df['b']is zero is what causes df.corrto be NaN everywhere.
df['a']和的标准差df['b']为零的事实是导致df.corr到处都是 NaN 的原因。
From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:
从下面的评论中,听起来您正在寻找Beta。它与 Pearson 相关系数有关,但不是除以标准差的乘积:


you divide by a variance:
你除以方差:


You can compute Betausing np.cov
您可以Beta使用np.cov进行计算
cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)
def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
"""
http://stackoverflow.com/a/13203189/190597 (unutbu)
"""
dt = float(T) / N
t = np.linspace(0, T, N)
W = np.random.standard_normal(size=N)
W = np.cumsum(W) * np.sqrt(dt) # standard brownian motion ###
X = (mu - 0.5 * sigma ** 2) * t + sigma * W
S = S0 * np.exp(X) # geometric brownian motion ###
return S
N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)
cov = np.cov(a, b)
print(cov)
# [[ 0.38234755 0.80525967]
# [ 0.80525967 1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015
plt.plot(a)
plt.plot(b)
plt.show()


The ratio of mus is 2, and betais ~2.1.
mus的比率为 2,beta约为 2.1。
And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):
你也可以用 来计算它df.corr,虽然这是一种更迂回的方法(但很高兴看到有一致性):
import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)

