Python 使用 .corr 获取两列之间的相关性

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42579908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:55:09  来源:igfitidea点击:

Use .corr to get the correlation between two columns

pythonpandascorrelation

提问by tong zhu

I have the following pandas dataframe Top15: enter image description here

我有以下熊猫数据框Top15在此处输入图片说明

I create a column that estimates the number of citable documents per person:

我创建了一个列来估计每个人的可引用文献数量:

Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']

I want to know the correlation between the number of citable documents per capita and the energy supply per capita. So I use the .corr()method (Pearson's correlation):

我想知道人均可引用文献数量与人均能源供应量之间的相关性。所以我使用的.corr()方法(皮尔逊相关):

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

I want to return a single number, but the result is: enter image description here

我想返回一个数字,但结果是: 在此处输入图片说明

回答by Cleb

Without actual data it is hard to answer the question but I guess you are looking for something like this:

没有实际数据很难回答这个问题,但我猜你正在寻找这样的东西:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

That calculates the correlation between your two columns'Citable docs per Capita'and 'Energy Supply per Capita'.

这将计算您的两列'Citable docs per Capita'之间的相关性'Energy Supply per Capita'

To give an example:

举个例子:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

Then

然后

df['A'].corr(df['B'])

gives 1as expected.

1按预期给出。

Now, if you change a value, e.g.

现在,如果您更改一个值,例如

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

the command

命令

df['A'].corr(df['B'])

returns

返回

0.99586

which is still close to 1, as expected.

正如预期的那样,它仍然接近 1。

If you apply .corrdirectly to your dataframe, it will return all pairwise correlations between your columns; that's why you then observe 1sat the diagonal of your matrix (each column is perfectly correlated with itself).

如果您.corr直接应用于您的数据框,它将返回您的列之间的所有成对相关性;这就是为什么您然后1s在矩阵的对角线上观察的原因(每列与其自身完全相关)。

df.corr()

will therefore return

因此将返回

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

在您显示的图形中,仅表示相关矩阵的左上角(我假设)。

There can be cases, where you get NaNs in your solution - check this postfor an example.

在某些情况下,NaN您的解决方案中可能会出现s - 请查看此帖子以获取示例。

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answerand if you then run into the issue with overlapping axis-labels check the following post.

如果你想过滤高于/低于某个阈值的条目,你可以检查这个问题。如果您想绘制相关系数的热图,您可以查看此答案,如果您随后遇到轴标签重叠的问题,请查看以下帖子

回答by Gary

I ran into the same issue. It appeared Citable Documents per Personwas a float, and python skips it somehow by default. All the other columns of my dataframe were in numpy-formats, so I solved it by converting the columnt to np.float64

我遇到了同样的问题。它似乎Citable Documents per Person是一个浮点数,默认情况下 python 会以某种方式跳过它。我的数据帧的所有其他列都是 numpy 格式,所以我通过将 columnt 转换为np.float64

Top15['Citable Documents per Person']=np.float64(Top15['Citable Documents per Person'])

Remember it's exactly the column you calculated yourself

请记住,这正是您自己计算的列

回答by ibozkurt79

My solution would be after converting data to numerical type:

我的解决方案是将数据转换为数字类型后:

Top15[['Citable docs per Capita','Energy Supply per Capita']].corr()

回答by Orca

It works like this:

它是这样工作的:

Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])

Top15['Energy Supply per Capita']=np.float64(Top15['Energy Supply per Capita'])

Top15['Energy Supply per Capita'].corr(Top15['Citable docs per Capita'])

回答by aumpen

When you call this:

当你调用这个时:

data = Top15[['Citable docs per Capita','Energy Supply per Capita']]
correlation = data.corr(method='pearson')

Since, DataFrame.corr() function performs pair-wise correlations, you have four pair from two variables. So, basically you are getting diagonal values as auto correlation (correlation with itself, two values since you have two variables), and other two values as cross correlations of one vs another and vice versa.

由于 DataFrame.corr() 函数执行成对相关,因此您有来自两个变量的四对。所以,基本上你得到对角线值作为自相关(与自身相关,两个值,因为你有两个变量),另外两个值作为一个与另一个的互相关,反之亦然。

Either perform correlation between two series to get a single value:

要么执行两个系列之间的相关性以获得单个值:

from scipy.stats.stats import pearsonr
docs_col = Top15['Citable docs per Capita'].values
energy_col = Top15['Energy Supply per Capita'].values
corr , _ = pearsonr(docs_col, energy_col)

or, if you want a single value from the same function (DataFrame's corr):

或者,如果您想要来自同一函数的单个值(DataFrame 的 corr):

single_value = correlation[0][1] 

Hope this helps.

希望这可以帮助。

回答by mgoldwasser

If you want the correlations between all pairs of columns, you could do something like this:

如果您想要所有列对之间的相关性,您可以执行以下操作:

import pandas as pd
import numpy as np

def get_corrs(df):
    col_correlations = df.corr()
    col_correlations.loc[:, :] = np.tril(col_correlations, k=-1)
    cor_pairs = col_correlations.stack()
    return cor_pairs.to_dict()

my_corrs = get_corrs(df)
# and the following line to retrieve the single correlation
print(my_corrs[('Citable docs per Capita','Energy Supply per Capita')])

回答by BID

I solved this problem by changing the data type. If you see the 'Energy Supply per Capita' is a numerical type while the 'Citable docs per Capita' is an object type. I converted the column to float using astype. I had the same problem with some np functions: count_nonzeroand sumworked while meanand stddidn't.

我通过更改数据类型解决了这个问题。如果您看到“人均能源供应”是数字类型,而“人均可引用文档”是对象类型。我使用 astype 将列转换为浮动。我在一些 np 函数上遇到了同样的问题:count_nonzero并且sum工作了meanstd但没有工作。