pandas 计算一个 DataFrame 的所有列与另一个 DataFrame 的所有列之间的相关性？

Question

提问by Deets McGeets

I have a DataFrame object stocksfilled with stock returns. I have another DataFrame object industriesfilled with industry returns. I want to find each stock's correlation with each industry.

我有一个stocks填充了股票回报的 DataFrame 对象。我有另一个industries充满行业回报的DataFrame 对象。我想找到每只股票与每个行业的相关性。

import numpy as np
np.random.seed(123)

df1=pd.DataFrame( {'s1':np.random.randn(10000), 's2':np.random.randn(10000) } )
df2=pd.DataFrame( {'i1':np.random.randn(10000), 'i2':np.random.randn(10000) } )

The expensive way to do this is to merge the two DataFrame objects, calculate correlation, and then throw out all the stock to stock and industry to industry correlations. Is there a more efficient way to do this?

这样做的代价高昂的方法是合并两个 DataFrame 对象，计算相关性，然后抛出所有股票到股票和行业到行业的相关性。有没有更有效的方法来做到这一点？

Answer 1

回答by ytsaig

And here's a one-liner that uses applyon the columns and avoids the nested for loops. The main benefit is that applybuilds the result in a DataFrame.

这是apply在列上使用的单行并避免嵌套的 for 循环。主要好处是apply在 DataFrame中构建结果。

df1.apply(lambda s: df2.corrwith(s))

Answer 2

回答by failwhale

Here's a slightly simpler answer than JohnE's that uses pandas natively instead of using numpy.corrcoef. As an added bonus, you don't have to retrieve the correlation value out of a silly 2x2 correlation matrix, because pandas's series-to-series correlation function simply returns a number, not a matrix.

这是一个比 JohnE 更简单的答案，它本机使用Pandas而不是使用 numpy.corrcoef。作为额外的好处，您不必从愚蠢的 2x2 相关矩阵中检索相关值，因为 Pandas 的系列到系列相关函数只返回一个数字，而不是一个矩阵。

In [133]: for s in ['s1','s2']:
     ...:     for i in ['i1','i2']:
     ...:         print df1[s].corr(df2[i])

Answer 3

回答by JohnE

(Edit to add: Instead of this answer please check out @yt's answer which was added later but is clearly better.)

（编辑添加：请查看稍后添加但显然更好的@yt 答案，而不是此答案。）

You could go with numpy.corrcoef()which is basically the same as corrin pandas, but the syntax may be more amenable to what you want.

你可以使用numpy.corrcoef()它与corrPandas基本相同，但语法可能更适合你想要的。

for s in ['s1','s2']:
    for i in ['i1','i2']:
        print( 'corrcoef',s,i,np.corrcoef(df1[s],df2[i])[0,1] )

That prints:

那打印：

corrcoef s1 i1 -0.00416977553597
corrcoef s1 i2 -0.0096393047035
corrcoef s2 i1 -0.026278689352
corrcoef s2 i2 -0.00402030582064

Alternatively you could load the results into a dataframe with appropriate labels:

或者，您可以将结果加载到具有适当标签的数据框中：

cc = pd.DataFrame()     
for s in ['s1','s2']:
    for i in ['i1','i2']:
        cc = cc.append( pd.DataFrame(
             { 'corrcoef':np.corrcoef(df1[s],df2[i])[0,1] }, index=[s+'_'+i]))

Which looks like this:

看起来像这样：

       corrcoef
s1_i1 -0.004170
s1_i2 -0.009639
s2_i1 -0.026279
s2_i2 -0.004020

pandas 计算一个 DataFrame 的所有列与另一个 DataFrame 的所有列之间的相关性？

提问by Deets McGeets

回答by ytsaig

回答by failwhale

回答by JohnE

相关推荐

最近更新

标签

pandas 计算一个 DataFrame 的所有列与另一个 DataFrame 的所有列之间的相关性？

提问by Deets McGeets

回答by ytsaig

回答by failwhale

回答by JohnE

相关推荐

如何删除 Pandas 数据帧索引的“秒”？

使用 Pandas 将唯一数字转换为 md5 哈希

pandas query() 方法中的错误？

Pandas read_csv 混合类型列作为字符串

相关推荐

最近更新

标签