Python DataFrame 中列之间的相关性

Question

提问by Zach Moshe

I'm pretty new to pandas, so I guess I'm doing something wrong -

我对熊猫很陌生，所以我想我做错了什么 -

I have a DataFrame:

我有一个数据框：

     a     b
0  0.5  0.75
1  0.5  0.75
2  0.5  0.75
3  0.5  0.75
4  0.5  0.75

df.corr()gives me:

df.corr()给我：

    a   b
a NaN NaN
b NaN NaN

but np.correlate(df["a"], df["b"])gives: 1.875

但np.correlate(df["a"], df["b"])给出：1.875

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr()does that (at least according to the documentation). Why does it return NaN?

这是为什么？我想要我的 DataFrame 的相关矩阵，并认为这样corr()做（至少根据文档）。为什么会返回NaN？

What's the correct way to calculate?

正确的计算方法是什么？

Many thanks!

非常感谢！

Answer 1

采纳答案by unutbu

np.correlatecalculates the (unnormalized) cross-correlationbetween two 1-dimensional sequences:

np.correlate计算两个一维序列之间的（非标准化）互相关：

z[k] = sum_n a[n] * conj(v[n+k])

while df.corr(by default) calculates the Pearson correlation coefficient.

而df.corr（默认情况下）计算Pearson 相关系数。

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

相关系数（如果存在）始终介于 -1 和 1 之间（含）。互相关是无界的。

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

这些公式有些相关，但请注意，在互相关公式（上面）中，没有减去均值，也没有除以标准差，这是 Pearson 相关系数公式的一部分。

The fact that the standard deviation of df['a']and df['b']is zero is what causes df.corrto be NaN everywhere.

df['a']和的标准差df['b']为零的事实是导致df.corr到处都是 NaN 的原因。

From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

从下面的评论中，听起来您正在寻找Beta。它与 Pearson 相关系数有关，但不是除以标准差的乘积：

enter image description here

在此处输入图片说明

you divide by a variance:

你除以方差：

enter image description here

在此处输入图片说明

You can compute Betausing np.cov

您可以Beta使用np.cov进行计算

cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)


def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
    """
    http://stackoverflow.com/a/13203189/190597 (unutbu)
    """
    dt = float(T) / N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size=N)
    W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###
    X = (mu - 0.5 * sigma ** 2) * t + sigma * W
    S = S0 * np.exp(X)  # geometric brownian motion ###
    return S

N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)

cov = np.cov(a, b)
print(cov)
# [[ 0.38234755  0.80525967]
#  [ 0.80525967  1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015

plt.plot(a)
plt.plot(b)
plt.show()

enter image description here

在此处输入图片说明

The ratio of mus is 2, and betais ~2.1.

mus的比率为 2，beta约为 2.1。

And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

你也可以用来计算它df.corr，虽然这是一种更迂回的方法（但很高兴看到有一致性）：

import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)

Python DataFrame 中列之间的相关性

提问by Zach Moshe

采纳答案by unutbu

相关推荐

最近更新

标签

Python DataFrame 中列之间的相关性

提问by Zach Moshe

采纳答案by unutbu

相关推荐

Python Scipy 曲线拟合运行时错误：找不到最佳参数：函数调用次数已达到 maxfev = 1000

Python 如何一次迭代两个字典并使用两者的值和键获得结果

Python 允许 Argparse 参数的特定值

在python中连接列表中元组的元素

相关推荐

最近更新

标签