pandas 列列表 X 整个数据框之间的熊猫相关性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45487145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:10:22  来源:igfitidea点击:

Pandas Correlation Between List of Columns X Whole Dataframe

pythonpandasdata-visualizationdata-science

提问by julianstanley

I'm looking for help with the Pandas .corr() method.

我正在寻找有关 Pandas .corr() 方法的帮助。

As is, I can use the .corr() method to calculate a heatmap of every possible combination of columns:

照原样,我可以使用 .corr() 方法来计算每个可能的列组合的热图:

corr = data.corr()
sns.heatmap(corr)

Which, on my dataframe of 23,000 columns, may terminate near the heat death of the universe.

其中,在我的 23,000 列数据框中,可能会在宇宙热死附近终止。

I can also do the more reasonable correlation between a subset of values

我还可以在值的子集之间进行更合理的相关

data2 = data[list_of_column_names]
corr = data2.corr(method="pearson")
sns.heatmap(corr)

That gives me something that I can use--here's an example of what that looks like: Example Heatmap

这给了我一些我可以使用的东西——这是一个看起来像的例子: 示例热图

What I would like to do is compare a list of 20 columns with the whole dataset. The normal .corr() function can give me a 20x20 or 23,000x23,000 heatmap, but essentially I would like a 20x23,000 heatmap.

我想做的是将 20 列的列表与整个数据集进行比较。正常的 .corr() 函数可以给我一个 20x20 或 23,000x23,000 的热图,但基本上我想要一个 20x23,000 的热图。

How can I add more specificity to my correlations?

如何为我的相关性添加更多特异性?

Thanks for the help!

谢谢您的帮助!

回答by Andrew

Make a list of the subset that you want (in this example it is A, B, and C), create an empty dataframe, then fill it with the desired values using a nested loop.

列出您想要的子集(在本例中是 A、B 和 C),创建一个空数据框,然后使用嵌套循环用所需的值填充它。

df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG'))

# initiate empty dataframe
corr = pd.DataFrame()
for a in list('ABC'):
    for b in list(df.columns.values):
        corr.loc[a, b] = df.corr().loc[a, b]

corr
Out[137]: 
          A         B         C         D         E         F         G
A  1.000000  0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573
B  0.183584  1.000000  0.119418  0.254775 -0.131564 -0.226491 -0.202978
C -0.175979  0.119418  1.000000  0.146807 -0.045952 -0.037082 -0.204993

sns.heatmap(corr)

enter image description here

enter image description here

回答by julianstanley

After working through this last night, I came to the following answer:

昨晚解决了这个问题后,我得出了以下答案:

#datatable imported earlier as 'data'
#Create a new dictionary
plotDict = {}
# Loop across each of the two lists that contain the items you want to compare
for gene1 in list_1:
    for gene2 in list_2:
        # Do a pearsonR comparison between the two items you want to compare
        tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])}
        # Update the dictionary each time you do a comparison
        plotDict.update(tempDict)
# Unstack the dictionary into a DataFrame
dfOutput = pd.Series(plotDict).unstack()
# Optional: Take just the pearsonR value out of the output tuple
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0]))
# Optional: generate a heatmap
sns.heatmap(dfOutputPearson)

Much like the other answers, this generates a heatmap (see below) but it can be scaled to allow for a 20,000x30 matrix without computing the correlation between the entire 20,000x20,000 combinations (and therefore terminating much quicker).

与其他答案非常相似,这会生成一个热图(见下文),但它可以缩放以允许 20,000x30 矩阵,而无需计算整个 20,000x20,000 组合之间的相关性(因此终止得更快)。

HeatMap Final

HeatMap Final

回答by Marcel Flygare

Usually the calculation of correlation coefficients pairwise for all variables make most sense. pd.corr() is convenience function to calculate the correlation coefficient pairwise (and for all pairs). You can do it with scipy also only for specified pairs within a loop.

通常,对所有变量成对计算相关系数最有意义。pd.corr() 是成对计算相关系数(以及所有对)的便捷函数。您也可以使用 scipy 仅对循环中的指定对执行此操作。

Example:

例子:

d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])

One pair in pandas could be:

Pandas中的一对可能是:

d.corr().loc['A','B']

-0.98782916114726194

-0.98782916114726194

Equivalent in scipy:

相当于 scipy:

import scipy.stats
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]

-0.98782916114726194

-0.98782916114726194