pandas 计算所有列之间的成对相关性

Question

提问by z991

I am working with large biological dataset.

我正在处理大型生物数据集。

I want to calculate PCC(Pearson's correlation coefficient) of all 2-column combinations in my data table and save the result as DataFrame or CSV file.

我想计算数据表中所有 2 列组合的 PCC（Pearson 相关系数），并将结果保存为 DataFrame 或 CSV 文件。

Data table is like below:columns are the name of genes, and rows are the code of dataset. The float numbers mean how much the gene is activated in the dataset.

数据表如下：列是基因的名称，行是数据集的代码。浮点数表示基因在数据集中被激活的程度。

      GeneA GeneB GeneC ...
DataA 1.5 2.5 3.5 ...
DataB 5.5 6.5 7.5 ...
DataC 8.5 8.5 8.5 ...
...

As a output, I want to build the table(DataFrame or csv file) like below, because scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean the results of pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). In the same way, ZZ and AA mean the result of pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I do not need the redundant data such as GeneB_GeneA or GeneC_GeneB in my test.

作为输出，我想构建如下表（DataFrame 或 csv 文件），因为 scipy.stats.pearsonr 函数返回（PCC，p 值）。在我的例子中，XX 和 YY 表示 pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]) 的结果。同理，ZZ 和 AA 表示 pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]) 的结果。在我的测试中，我不需要 GeneB_GeneA 或 GeneC_GeneB 等冗余数据。

               PCC P-value
GeneA_GeneB    XX YY
GeneA_GeneC    ZZ AA
GeneB_GeneC    BB CC
...

As the number of columns and rows are many(over 100) and their names are complicated, using column names or row names will be difficult.

由于列和行的数量很多（超过100）并且它们的名称很复杂，因此使用列名或行名会很困难。

It might be a simple problem for experts, I do not know how to deal with this kind of table with python and pandas library. Especially making new DataFrame and adding result seems to be very difficult.

对于专家来说可能是一个简单的问题，我不知道如何使用python和pandas库来处理这种表。尤其是制作新的 DataFrame 并添加结果似乎非常困难。

Sorry for my poor explanation, but I hope someone could help me.

对不起，我的解释不好，但我希望有人能帮助我。

Answer 1

回答by Stefan

from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools

Creating random sample data:

创建随机样本数据：

df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)]) 
print(df)

     gene_a    gene_b    gene_c    gene_d    gene_e
0  0.471257  0.854139  0.781204  0.678567  0.697993
1  0.292909  0.046159  0.250902  0.064004  0.307537
2  0.422265  0.646988  0.084983  0.822375  0.713397
3  0.113963  0.016122  0.227566  0.206324  0.792048
4  0.357331  0.980479  0.157124  0.560889  0.973161

correlations = {}
columns = df.columns.tolist()

for col_a, col_b in itertools.combinations(columns, 2):
    correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])

result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']

print(result.sort_index())

                     PCC   p-value
gene_a__gene_b  0.461357  0.434142
gene_a__gene_c  0.177936  0.774646
gene_a__gene_d -0.854884  0.064896
gene_a__gene_e -0.155440  0.802887
gene_b__gene_c -0.575056  0.310455
gene_b__gene_d -0.097054  0.876621
gene_b__gene_e  0.061175  0.922159
gene_c__gene_d -0.633302  0.251381
gene_c__gene_e -0.771120  0.126836
gene_d__gene_e  0.531805  0.356315

Get unique combinations of DataFramecolumns using itertools.combination(iterable, r)
Iterate through these combinations and calculate pairwise correlations using scipy.stats.stats.personr
Add results (PCC and p-value tuple) to dictionary
Build DataFramefrom dictionary

使用获取DataFrame列的唯一组合itertools.combination(iterable, r)
迭代这些组合并使用以下方法计算成对相关性 scipy.stats.stats.personr
将结果（PCC 和 p 值元组）添加到 dictionary
构建DataFrame自dictionary

You could then also save result.to_csv(). You might find it convenient to use a MultiIndex(two columns containing the names of each columns) instead of the created names for the pairwise correlations.

然后您还可以保存result.to_csv(). 您可能会发现使用 a MultiIndex（包含每列名称的两列）代替为成对相关创建的名称很方便。

Answer 2

回答by chenzhongpu

To get pairs, it is a combinationsproblem. You can concatall the rows into one the result dataframe.

要得到对，这是一个combinations问题。您可以将concat所有行合并为一个结果dataframe。

from pandas import *
from itertools import combinations
df = pandas.read_csv('gene.csv')
# get the column names as list, which are gene names
column_list = df.columns.values.tolist()
result = []
for c in combinations(column_list, 2):
    firstGene, secondGene = c
    firstGeneData = df[firstGene].tolist()
    secondGeneData = df[secondGene].tolist()
    # now to get the PCC, P-value using scipy
    pcc = ...
    p-value = ...
    result.append(pandas.DataFrame([{'PCC': pcc, 'P-value': p-value}], index=str(firstGene)+ '_' + str(secondGene), columns=['PCC', 'P-value'])

result_df = pandas.concat(result)
#result_df.to_csv(...)

Answer 3

回答by Raphael

A simple solution is to use the pairwise_corrfunction of the Pingouin package(which I created):

一个简单的解决方案是使用所述pairwise_corr所述的功能Pingouin包（我创建）：

import pingouin as pg
pg.pairwise_corr(data, method='pearson')

This will give you a DataFrame with all combinations of columns, and, for each of those, the r-value, p-value, sample size, and more.

这将为您提供一个包含所有列组合的 DataFrame，对于每个列，还有 r 值、p 值、样本大小等。

There are also a number of options to specify one or more columns (e.g. one-vs-allbehavior), as well as covariates for partial correlation and different methods to calculate the correlation coefficient. Please see this example Jupyter Notebookfor a more in-depth demo.

还有许多选项可以指定一列或多列（例如一对一行为），以及偏相关的协变量和计算相关系数的不同方法。请参阅此示例 Jupyter Notebook以获得更深入的演示。

pandas 计算所有列之间的成对相关性

提问by z991

回答by Stefan

回答by chenzhongpu

回答by Raphael

相关推荐

最近更新

标签

pandas 计算所有列之间的成对相关性

提问by z991

回答by Stefan

回答by chenzhongpu

回答by Raphael

相关推荐

pandas 如何在 Bokeh (Python) 中绘制水平条形图

pandas 将字典转换为熊猫中的数据框列

从 Pandas 数据帧创建二维数组

pandas 在pandas中使用groupby时如何分别求和负值和正值？

相关推荐

最近更新

标签