pandas 与熊猫的加权相关系数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38641691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:41:22  来源:igfitidea点击:

Weighted correlation coefficient with pandas

pythonpandascorrelationpearson-correlation

提问by Yehuda Karlinsky

Is there any way to compute weighted correlation coefficient with pandas? I saw that R has such a method. Also, I'd like to get the p value of the correlation. This I did not find also in R. Link to Wikipedia for explanation about weighted correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient

有没有办法计算与Pandas的加权相关系数?我看到R有这样的方法。另外,我想获得相关性的 p 值。这我也没有在 R. 链接到维基百科以解释加权相关性:https: //en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Weighted_correlation_coefficient

回答by root

I don't know of any Python packages that implement this, but it should be fairly straightforward to roll your own implementation. Using the naming conventions of the wikipedia article:

我不知道有任何 Python 包可以实现这一点,但是推出自己的实现应该相当简单。使用维基百科文章的命名约定:

def m(x, w):
    """Weighted Mean"""
    return np.sum(x * w) / np.sum(w)

def cov(x, y, w):
    """Weighted Covariance"""
    return np.sum(w * (x - m(x, w)) * (y - m(y, w))) / np.sum(w)

def corr(x, y, w):
    """Weighted Correlation"""
    return cov(x, y, w) / np.sqrt(cov(x, x, w) * cov(y, y, w))

I tried to make the functions above match the formulas in the wikipedia as closely as possible, but there are some potential simplifications and performance improvements. For example, as pointed out by @Alberto Garcia-Raboso, m(x, w)is really just np.average(x, weights=w), so there's no need to actually write a function for it.

我试图使上述函数尽可能地匹配维基百科中的公式,但有一些潜在的简化和性能改进。例如,正如@Alberto Garcia-Raboso 所指出的,m(x, w)实际上只是np.average(x, weights=w),因此没有必要为它实际编写函数。

The functions are pretty bare-bones, just doing the calculations. You may want to consider forcing inputs to be arrays prior to doing the calculations, i.e. x = np.asarray(x), as these functions will not work if lists are passed. Additional checks to verify all inputs have equal length, non-null values, etc. could also be implemented.

这些函数非常简单,只是进行计算。在进行计算之前,您可能需要考虑强制输入为数组,即x = np.asarray(x),因为如果传递列表,这些函数将不起作用。还可以实施额外的检查以验证所有输入具有相等的长度、非空值等。

Example usage:

用法示例:

# Initialize a DataFrame.
np.random.seed([3,1415])
n = 10**6
df = pd.DataFrame({
    'x': np.random.choice(3, size=n),
    'y': np.random.choice(4, size=n),
    'w': np.random.random(size=n)
    })

# Compute the correlation.
r = corr(df['x'], df['y'], df['w'])

There's a discussion hereregarding the p-value. It doesn't look like there's a generic calculation, and it depends on how you're actually getting the weights.

有一个讨论,在这里关于p值。看起来没有通用计算,这取决于您实际如何获得权重。