Pandas 相关性 Groupby

Question

提问by bsheehy

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).

假设我有一个类似于下面的数据框，我将如何获得 2 个特定列之间的相关性，然后按“ID”列进行分组？我相信 Pandas 'corr' 方法可以找到所有列之间的相关性。如果可能，我还想知道如何使用 .agg 函数（即 np.correlate）找到“groupby”相关性。

What I have:

我拥有的：

ID  Val1    Val2    OtherData   OtherData
A   5       4       x           x
A   4       5       x           x
A   6       6       x           x
B   4       1       x           x
B   8       2       x           x
B   7       9       x           x
C   4       8       x           x
C   5       5       x           x
C   2       1       x           x

What I need:

我需要的：

ID  Correlation_Val1_Val2
A   0.12
B   0.22
C   0.05

Thanks!

谢谢！

Answer 1

回答by JohnE

You pretty much figured out all the pieces, just need to combine them:

你几乎弄清楚了所有的部分，只需要组合它们：

>>> df.groupby('ID')[['Val1','Val2']].corr()

             Val1      Val2
ID                         
A  Val1  1.000000  0.500000
   Val2  0.500000  1.000000
B  Val1  1.000000  0.385727
   Val2  0.385727  1.000000

In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

在您的情况下，为每个 ID 打印 2x2 过于冗长。我没有看到打印标量相关性而不是整个矩阵的选项，但是如果您只有两个变量，则可以执行以下简单的操作：

>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]

ID       
A   Val1    0.500000
B   Val1    0.385727

For the more general case of 3+ variables

对于 3+ 变量的更一般情况

For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

对于 3 个或更多变量，创建简洁的输出并不简单，但您可以执行以下操作：

groups = list('Val1', 'Val2', 'Val3', 'Val4')
df2 = pd.DataFrame()
for i in range( len(groups)-1): 
    df2 = df2.append( df.groupby('ID')[groups].corr().stack()
                        .loc[:,groups[i],groups[i+1]:].reset_index() )

df2.columns = ['ID', 'v1', 'v2', 'corr']
df2.set_index(['ID','v1','v2']).sort_index()

Note that if we didn't have the groupbyelement, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

请注意，如果我们没有该groupby元素，则可以直接使用 numpy 中的上三角或下三角函数。但由于该元素存在，据我所知，以更优雅的方式生成简洁的输出并不容易。

Answer 2

回答by Ravaging Care

In the above answer; since ix has been depreciated use iloc instead with some minor other changes:

在上面的回答中；由于 ix 已折旧，因此使用 iloc 进行了一些其他小的更改：

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame

or

或者

df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series

Answer 3

回答by VovaM

One more simple solution:

一个更简单的解决方案：

df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]

Pandas 相关性 Groupby

提问by bsheehy

回答by JohnE

For the more general case of 3+ variables

对于 3+ 变量的更一般情况

回答by Ravaging Care

回答by VovaM

相关推荐

最近更新

标签

Pandas 相关性 Groupby

提问by bsheehy

回答by JohnE

For the more general case of 3+ variables

对于 3+ 变量的更一般情况

回答by Ravaging Care

回答by VovaM

相关推荐

pandas 腌制熊猫数据帧的最快方法是什么？

pandas 当 <tr> 有 rowspan 时我该怎么办

获取 HDF5 内容列表 (Pandas HDFStore)

pandas 冻结熊猫数据框中的标题

相关推荐

最近更新

标签