Pandas 相关性 Groupby
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28988627/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Correlation Groupby
提问by bsheehy
Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).
假设我有一个类似于下面的数据框,我将如何获得 2 个特定列之间的相关性,然后按“ID”列进行分组?我相信 Pandas 'corr' 方法可以找到所有列之间的相关性。如果可能,我还想知道如何使用 .agg 函数(即 np.correlate)找到“groupby”相关性。
What I have:
我拥有的:
ID Val1 Val2 OtherData OtherData
A 5 4 x x
A 4 5 x x
A 6 6 x x
B 4 1 x x
B 8 2 x x
B 7 9 x x
C 4 8 x x
C 5 5 x x
C 2 1 x x
What I need:
我需要的:
ID Correlation_Val1_Val2
A 0.12
B 0.22
C 0.05
Thanks!
谢谢!
回答by JohnE
You pretty much figured out all the pieces, just need to combine them:
你几乎弄清楚了所有的部分,只需要组合它们:
>>> df.groupby('ID')[['Val1','Val2']].corr()
Val1 Val2
ID
A Val1 1.000000 0.500000
Val2 0.500000 1.000000
B Val1 1.000000 0.385727
Val2 0.385727 1.000000
In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:
在您的情况下,为每个 ID 打印 2x2 过于冗长。我没有看到打印标量相关性而不是整个矩阵的选项,但是如果您只有两个变量,则可以执行以下简单的操作:
>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]
ID
A Val1 0.500000
B Val1 0.385727
For the more general case of 3+ variables
对于 3+ 变量的更一般情况
For 3 or more variables, it is not straightforward to create concise output but you could do something like this:
对于 3 个或更多变量,创建简洁的输出并不简单,但您可以执行以下操作:
groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1):
df2 = df2.append( df.groupby('ID')[groups].corr().stack()
.loc[:,groups[i],groups[i+1]:].reset_index() )
df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()
Note that if we didn't have the groupbyelement, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.
请注意,如果我们没有该groupby元素,则可以直接使用 numpy 中的上三角或下三角函数。但由于该元素存在,据我所知,以更优雅的方式生成简洁的输出并不容易。
回答by Ravaging Care
In the above answer; since ix has been depreciated use iloc instead with some minor other changes:
在上面的回答中;由于 ix 已折旧,因此使用 iloc 进行了一些其他小的更改:
df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame
or
或者
df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series
回答by VovaM
One more simple solution:
一个更简单的解决方案:
df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]

