Python DataFrame 中列之间的相关性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15854878/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:10:39  来源:igfitidea点击:

Correlation between columns in DataFrame

pythonpandas

提问by Zach Moshe

I'm pretty new to pandas, so I guess I'm doing something wrong -

我对熊猫很陌生,所以我想我做错了什么 -

I have a DataFrame:

我有一个数据框:

     a     b
0  0.5  0.75
1  0.5  0.75
2  0.5  0.75
3  0.5  0.75
4  0.5  0.75

df.corr()gives me:

df.corr()给我:

    a   b
a NaN NaN
b NaN NaN

but np.correlate(df["a"], df["b"])gives: 1.875

np.correlate(df["a"], df["b"])给出:1.875

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr()does that (at least according to the documentation). Why does it return NaN?

这是为什么?我想要我的 DataFrame 的相关矩阵,并认为这样corr()做(至少根据文档)。为什么会返回NaN

What's the correct way to calculate?

正确的计算方法是什么?

Many thanks!

非常感谢!

采纳答案by unutbu

np.correlatecalculates the (unnormalized) cross-correlationbetween two 1-dimensional sequences:

np.correlate计算两个一维序列之间的(非标准化)互相关

z[k] = sum_n a[n] * conj(v[n+k])

while df.corr(by default) calculates the Pearson correlation coefficient.

df.corr(默认情况下)计算Pearson 相关系数

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

相关系数(如果存在)始终介于 -1 和 1 之间(含)。互相关是无界的。

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

这些公式有些相关,但请注意,在互相关公式(上面)中,没有减去均值,也没有除以标准差,这是 Pearson 相关系数公式的一部分。

The fact that the standard deviation of df['a']and df['b']is zero is what causes df.corrto be NaN everywhere.

df['a']和的标准差df['b']为零的事实是导致df.corr到处都是 NaN 的原因。



From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

从下面的评论中,听起来您正在寻找Beta。它与 Pearson 相关系数有关,但不是除以标准差的乘积:

enter image description here

在此处输入图片说明

you divide by a variance:

你除以方差:

enter image description here

在此处输入图片说明



You can compute Betausing np.cov

您可以Beta使用np.cov进行计算

cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]


import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)


def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
    """
    http://stackoverflow.com/a/13203189/190597 (unutbu)
    """
    dt = float(T) / N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size=N)
    W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###
    X = (mu - 0.5 * sigma ** 2) * t + sigma * W
    S = S0 * np.exp(X)  # geometric brownian motion ###
    return S

N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)

cov = np.cov(a, b)
print(cov)
# [[ 0.38234755  0.80525967]
#  [ 0.80525967  1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015

plt.plot(a)
plt.plot(b)
plt.show()

enter image description here

在此处输入图片说明

The ratio of mus is 2, and betais ~2.1.

mus的比率为 2,beta约为 2.1。



And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

你也可以用 来计算它df.corr,虽然这是一种更迂回的方法(但很高兴看到有一致性):

import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)