Python 如何对 Pandas 数据框的选定列进行 Pearson 相关
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34896455/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to do Pearson correlation of selected columns of a Pandas data frame
提问by neversaint
I have a CSV that looks like this:
我有一个看起来像这样的 CSV:
gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3
And as data frame it looks like this:
作为数据框,它看起来像这样:
In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
gene stem1 stem2 stem3 b1 b2 b3 special_col
0 foo 20 10 11 23 22 79 3
1 bar 17 13 505 12 13 88 1
2 qui 17 13 5 12 13 88 3
What I want to do is to perform pearson correlation from last column (special_col
) with every columns between gene
column and special column
, i.e. colnames[1:number_of_column-1]
我想要做的是从最后一列 ( special_col
) 与gene
列和之间的每一列执行皮尔逊相关special column
,即colnames[1:number_of_column-1]
At the end of the day we will have length 6 data frame.
在一天结束时,我们将拥有长度为 6 的数据帧。
Coln PearCorr
stem1 0.5
stem2 -0.5
stem3 -0.9999453506011533
b1 0.5
b2 0.5
b3 -0.5
The above value is computed manually:
上述值是手动计算的:
In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)
How can I do that?
我怎样才能做到这一点?
采纳答案by Phlya
Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.
请注意,您的数据中存在错误,特殊的 col 全部为 3,因此无法计算相关性。
If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.
如果最后删除列选择,您将获得正在分析的所有其他列的相关矩阵。最后一个 [:-1] 是删除 'special_col' 与自身的相关性。
In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
If you are interested in speed, this is slightly faster on my machine:
如果您对速度感兴趣,这在我的机器上稍微快一点:
In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]:
array([ 0.5 , -0.5 , -0.99994535, 0.5 , 0.5 ,
-0.5 ])
In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 μs per loop
In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 μs per loop
But obviously, it returns an array, not a pandas series/DF.
但显然,它返回一个数组,而不是一个熊猫系列/DF。
回答by EdChum
You can apply
on your column range with a lambda
that calls corr
and pass the Series
'special_col'
:
您可以apply
在您的列范围内lambda
调用corr
并传递Series
'special_col'
:
In [126]:
df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
Out[126]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
dtype: float64
Timings
时间安排
Actually the other method is quicker so I expect it to scale better:
实际上另一种方法更快,所以我希望它能够更好地扩展:
In [130]:
%timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
%timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 1.75 ms per loop
1000 loops, best of 3: 836 μs per loop
回答by Anton Protopopov
Why not just do:
为什么不这样做:
In [34]: df.corr().iloc[:-1,-1]
Out[34]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
or:
或者:
In [39]: df.corr().ix['special_col', :-1]
Out[39]:
stem1 0.500000
stem2 -0.500000
stem3 -0.999945
b1 0.500000
b2 0.500000
b3 -0.500000
Name: special_col, dtype: float64
Timings
时间安排
In [35]: %timeit df.corr().iloc[-1,:-1]
1000 loops, best of 3: 576 us per loop
In [40]: %timeit df.corr().ix['special_col', :-1]
1000 loops, best of 3: 634 us per loop
In [36]: %timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 968 us per loop
In [37]: %timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
100 loops, best of 3: 2.12 ms per loop
回答by Naga Pakalapati
pd.DataFrame.corrwith()can be used instead of df.corr().
pd.DataFrame.corrwith()可被用来代替df.corr() 。
pass in the intended column for which we want correlation with the rest of the columns.
传入我们希望与其余列相关联的预期列。
For specific example above the code will be: df.corrwith(df['special_col'])
对于上面的具体示例,代码将是: df.corrwith(df['special_col'])
or simply df.corr()['special_col']to create entire correlation of each column with other columns and subset what you need.
或者简单地df.corr()['special_col']来创建每列与其他列的完整相关性和子集你需要的。