Data Frames Pandas 中所有行的 Pearson 相关性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37969282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:26:56  来源:igfitidea点击:

Pearson correlation for all rows in Data Frames Pandas

pythonpandascorrelationfeature-extraction

提问by Batuhan B

I have a dataframe in Pandas and its shape is (136, 1445). I try to create correlation(Pearson) matrix for my 136 rows. So in the result, i need a matrix with size 136x136.

我在 Pandas 中有一个数据框,它的形状是 (136, 1445)。我尝试为我的 136 行创建相关(Pearson)矩阵。所以在结果中,我需要一个大小为 136x136 的矩阵。

I tried two different ways but i cannot get a results from them or when i create a 136x136 correlation matrix, i lost the columns name of dataframe.

我尝试了两种不同的方法,但无法从中获得结果,或者当我创建 136x136 相关矩阵时,我丢失了数据框的列名称。

First,

第一的,

gene_expression = pd.read_csv('padel_all_drug_results_original.csv',dtype='unicode')
gene_expression = gene_expression.convert_objects(convert_numeric=True)
gene_expression.corr()

This gives the column based pearson correlation matrix(1445*1445), and when I try to transpose my dataframe and then try to find correlation, the structure of dataframe is broken (like the columns name is lost or i dont even sure that the correlations are correct or not).

这给出了基于列的皮尔逊相关矩阵(1445 * 1445),当我尝试转置我的数据框然后尝试找到相关性时,数据框的结构被破坏(例如列名丢失或者我什至不确定相关性正确与否)。

Secondly,

其次,

distance = lambda column1, column2: pearsonr(column1,column2)[0]
result = gene_expression.apply(lambda col1: gene_expression.apply(lambda col2: distance(col1, col2)))

What should i do to calculate 136x136 pearson correlation matrix to do not change the original dataframe ?

我应该怎么做才能计算 136x136 皮尔逊相关矩阵以不更改原始数据帧?

Also, I have a 1445 features and some of columns nearly full of zeros. So I dropped that columns because they are noisy columns but have you got another idea to feature redeuction ?

此外,我有 1445 个特征和一些几乎全为零的列。所以我放弃了那些列,因为它们是嘈杂的列,但你有另一个想法来减少特征吗?

Thanks in advance

提前致谢

回答by Stefan

To get the correlation matrix containing pairwise correlation between all rows, you can:

要获得包含所有行之间成对相关的相关矩阵,您可以:

gene_expression.T.corr()

Using a toy example:

使用玩具示例:

df = pd.DataFrame(np.random.randint(0, high=100, size=(5, 10)), index=list(string.ascii_lowercase[:5]))

with 5 labeled rows and 10 columns:

带有 5 个标记的行和 10 列:

df.info()
Index: 5 entries, a to e
Data columns (total 10 columns):
0    5 non-null int64
1    5 non-null int64
2    5 non-null int64
3    5 non-null int64
4    5 non-null int64
5    5 non-null int64
6    5 non-null int64
7    5 non-null int64
8    5 non-null int64
9    5 non-null int64
dtypes: int64(10)
memory usage: 440.0+ bytes

Using

使用

df.T.corr()

yields

产量

          a         b         c         d         e
a  1.000000  0.209460 -0.205302 -0.294427  0.353803
b  0.209460  1.000000 -0.530715 -0.117949  0.775848
c -0.205302 -0.530715  1.000000 -0.245101 -0.344358
d -0.294427 -0.117949 -0.245101  1.000000  0.058302
e  0.353803  0.775848 -0.344358  0.058302  1.000000