Python 如何对 Pandas 数据框的选定列进行 Pearson 相关

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34896455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:41:03  来源:igfitidea点击:

How to do Pearson correlation of selected columns of a Pandas data frame

pythonpandas

提问by neversaint

I have a CSV that looks like this:

我有一个看起来像这样的 CSV:

gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

And as data frame it looks like this:

作为数据框,它看起来像这样:

In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
  gene  stem1  stem2  stem3  b1  b2  b3  special_col
0  foo     20     10     11  23  22  79            3
1  bar     17     13    505  12  13  88            1
2  qui     17     13      5  12  13  88            3

What I want to do is to perform pearson correlation from last column (special_col) with every columns between genecolumn and special column, i.e. colnames[1:number_of_column-1]

我想要做的是从最后一列 ( special_col) 与gene列和之间的每一列执行皮尔逊相关special column,即colnames[1:number_of_column-1]

At the end of the day we will have length 6 data frame.

在一天结束时,我们将拥有长度为 6 的数据帧。

Coln   PearCorr
stem1  0.5
stem2 -0.5
stem3 -0.9999453506011533
b1    0.5
b2    0.5
b3    -0.5

The above value is computed manually:

上述值是手动计算的:

In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)

How can I do that?

我怎样才能做到这一点?

采纳答案by Phlya

Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.

请注意,您的数据中存在错误,特殊的 col 全部为 3,因此无法计算相关性。

If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.

如果最后删除列选择,您将获得正在分析的所有其他列的相关矩阵。最后一个 [:-1] 是删除 'special_col' 与自身的相关性。

In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]: 
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

If you are interested in speed, this is slightly faster on my machine:

如果您对速度感兴趣,这在我的机器上稍微快一点:

In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]: 
array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,
       -0.5       ])

In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 μs per loop

In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 μs per loop

But obviously, it returns an array, not a pandas series/DF.

但显然,它返回一个数组,而不是一个熊猫系列/DF。

回答by EdChum

You can applyon your column range with a lambdathat calls corrand pass the Series'special_col':

您可以apply在您的列范围内lambda调用corr并传递Series'special_col'

In [126]:
df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))

Out[126]:
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
dtype: float64

Timings

时间安排

Actually the other method is quicker so I expect it to scale better:

实际上另一种方法更快,所以我希望它能够更好地扩展:

In [130]:
%timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
%timeit df[df.columns[1:]].corr()['special_col']

1000 loops, best of 3: 1.75 ms per loop
1000 loops, best of 3: 836 μs per loop

回答by Anton Protopopov

Why not just do:

为什么不这样做:

In [34]: df.corr().iloc[:-1,-1]
Out[34]:
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

or:

或者:

In [39]: df.corr().ix['special_col', :-1]
Out[39]:
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

Timings

时间安排

In [35]: %timeit df.corr().iloc[-1,:-1]
1000 loops, best of 3: 576 us per loop

In [40]: %timeit df.corr().ix['special_col', :-1]
1000 loops, best of 3: 634 us per loop

In [36]: %timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 968 us per loop

In [37]: %timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
100 loops, best of 3: 2.12 ms per loop

回答by Naga Pakalapati

pd.DataFrame.corrwith()can be used instead of df.corr().

pd.DataFrame.corrwith()可被用来代替df.corr()

pass in the intended column for which we want correlation with the rest of the columns.

传入我们希望与其余列相关联的预期列。

For specific example above the code will be: df.corrwith(df['special_col'])

对于上面的具体示例,代码将是: df.corrwith(df['special_col'])

or simply df.corr()['special_col']to create entire correlation of each column with other columns and subset what you need.

或者简单地df.corr()['special_col']来创建每列与其他列的完整相关性和子集你需要的。