pandas 列列表 X 整个数据框之间的熊猫相关性

Question

提问by julianstanley

I'm looking for help with the Pandas .corr() method.

我正在寻找有关 Pandas .corr() 方法的帮助。

As is, I can use the .corr() method to calculate a heatmap of every possible combination of columns:

照原样，我可以使用 .corr() 方法来计算每个可能的列组合的热图：

corr = data.corr()
sns.heatmap(corr)

Which, on my dataframe of 23,000 columns, may terminate near the heat death of the universe.

其中，在我的 23,000 列数据框中，可能会在宇宙热死附近终止。

I can also do the more reasonable correlation between a subset of values

我还可以在值的子集之间进行更合理的相关

data2 = data[list_of_column_names]
corr = data2.corr(method="pearson")
sns.heatmap(corr)

That gives me something that I can use--here's an example of what that looks like:

这给了我一些我可以使用的东西——这是一个看起来像的例子：

What I would like to do is compare a list of 20 columns with the whole dataset. The normal .corr() function can give me a 20x20 or 23,000x23,000 heatmap, but essentially I would like a 20x23,000 heatmap.

我想做的是将 20 列的列表与整个数据集进行比较。正常的 .corr() 函数可以给我一个 20x20 或 23,000x23,000 的热图，但基本上我想要一个 20x23,000 的热图。

How can I add more specificity to my correlations?

如何为我的相关性添加更多特异性？

Thanks for the help!

谢谢您的帮助！

Answer 1

回答by Andrew

Make a list of the subset that you want (in this example it is A, B, and C), create an empty dataframe, then fill it with the desired values using a nested loop.

列出您想要的子集（在本例中是 A、B 和 C），创建一个空数据框，然后使用嵌套循环用所需的值填充它。

df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG'))

# initiate empty dataframe
corr = pd.DataFrame()
for a in list('ABC'):
    for b in list(df.columns.values):
        corr.loc[a, b] = df.corr().loc[a, b]

corr
Out[137]: 
          A         B         C         D         E         F         G
A  1.000000  0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573
B  0.183584  1.000000  0.119418  0.254775 -0.131564 -0.226491 -0.202978
C -0.175979  0.119418  1.000000  0.146807 -0.045952 -0.037082 -0.204993

sns.heatmap(corr)

Answer 2

回答by julianstanley

After working through this last night, I came to the following answer:

昨晚解决了这个问题后，我得出了以下答案：

#datatable imported earlier as 'data'
#Create a new dictionary
plotDict = {}
# Loop across each of the two lists that contain the items you want to compare
for gene1 in list_1:
    for gene2 in list_2:
        # Do a pearsonR comparison between the two items you want to compare
        tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])}
        # Update the dictionary each time you do a comparison
        plotDict.update(tempDict)
# Unstack the dictionary into a DataFrame
dfOutput = pd.Series(plotDict).unstack()
# Optional: Take just the pearsonR value out of the output tuple
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0]))
# Optional: generate a heatmap
sns.heatmap(dfOutputPearson)

Much like the other answers, this generates a heatmap (see below) but it can be scaled to allow for a 20,000x30 matrix without computing the correlation between the entire 20,000x20,000 combinations (and therefore terminating much quicker).

与其他答案非常相似，这会生成一个热图（见下文），但它可以缩放以允许 20,000x30 矩阵，而无需计算整个 20,000x20,000 组合之间的相关性（因此终止得更快）。

Answer 3

回答by Marcel Flygare

Usually the calculation of correlation coefficients pairwise for all variables make most sense. pd.corr() is convenience function to calculate the correlation coefficient pairwise (and for all pairs). You can do it with scipy also only for specified pairs within a loop.

通常，对所有变量成对计算相关系数最有意义。pd.corr() 是成对计算相关系数（以及所有对）的便捷函数。您也可以使用 scipy 仅对循环中的指定对执行此操作。

Example:

例子：

d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])

One pair in pandas could be:

Pandas中的一对可能是：

d.corr().loc['A','B']

-0.98782916114726194

Equivalent in scipy:

相当于 scipy：

import scipy.stats
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]

-0.98782916114726194

pandas 列列表 X 整个数据框之间的熊猫相关性

提问by julianstanley

回答by Andrew

回答by julianstanley

回答by Marcel Flygare

相关推荐

最近更新

标签

pandas 列列表 X 整个数据框之间的熊猫相关性

提问by julianstanley

回答by Andrew

回答by julianstanley

回答by Marcel Flygare

相关推荐

如何保存用“pandas.DataFrame.plot”创建的图像？

pandas Python：用中值替换异常值

pandas Python中Dataframe中每一行之间的余弦相似度

pandas Python：数据参数不能是迭代器

相关推荐

最近更新

标签