pandas 熊猫数据框中行的距离矩阵

Question

提问by misterte

I have a pandas dataframe that looks as follows:

我有一个如下所示的Pandas数据框：

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1

The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:

问题是我目前正在使用皮尔逊相关性来计算行之间的相似性，并且鉴于数据的性质，有时标准偏差为零（所有值都是 1 或 NaN），因此皮尔逊相关性返回：

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN

Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?

有没有其他计算相关性的方法可以避免这种情况？也许是一种简单的方法来计算行之间的欧几里得距离，就像皮尔逊相关那样？

Thanks!

谢谢！

A.

一种。

Answer 1

回答by S Anand

The key question here is what distance metric to use.

这里的关键问题是使用什么距离度量。

Let's say this is your data.

假设这是您的数据。

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

What is the % difference?

有什么不同？

You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.

您可以将距离度量计算为每列之间不同值的百分比。结果显示任何 2 列之间的百分比差异。

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

What is the correlation coefficient?

什么是相关系数？

Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficientin case of binary data.

在这里，我们使用 Pearson 相关系数。这是一个完全有效的指标。具体来说，它在二进制数据的情况下转换为phi 系数。

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

Incidentally, this is the same result that you would get with the Spearman R coefficient as well.

顺便说一下，这与您使用 Spearman R 系数得到的结果相同。

What is the Euclidean distance?

什么是欧几里得距离？

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

By now, you'd have a sense of the pattern. Create a distancemethod. Then apply it pairwise to every column using

到现在为止，您应该对模式有所了解。创建distance方法。然后使用它成对地应用于每一列

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

If your distancemethod relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).

如果您的distance方法依赖于零而不是nans的存在，请使用.fillna(0).

Answer 2

回答by maparent

A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of

改进@s-anand 对欧几里得距离的优秀答案的建议：而不是

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

we can apply the fillna the fill only the missing data, thus:

我们可以应用 fillna 只填充缺失的数据，因此：

distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

This way, the distance on missing dimensions will not be counted.

这样，缺失维度上的距离将不会被计算在内。

Answer 3

回答by MyCarta

This is my numpy-only version of @S Anand's fantastic answer, which I put together in order to help myself understand his explanation better.

这是@S Anand精彩答案的numpy唯一版本，我将其放在一起以帮助自己更好地理解他的解释。

Happy to share it with a short, reproducible example:

很高兴与一个简短的、可重复的例子分享它：

# Preliminaries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Get iris dataset into a DataFrame
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

Let's try scipy.stats.pearsonrfirst.

我们scipy.stats.pearsonr先试试。

Executing:

执行：

distance = lambda column1, column2: pearsonr(column1, column2)[0]
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: distance(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns:

返回：

and:

和：

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: pearsonr(col1, col2)[0], 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

返回：

array([[1.00, -0.12, 0.87, 0.82, 0.78],
       [-0.12, 1.00, -0.43, -0.37, -0.43],
       [0.87, -0.43, 1.00, 0.96, 0.95],
       [0.82, -0.37, 0.96, 1.00, 0.96],
       [0.78, -0.43, 0.95, 0.96, 1.00]])

As a second example let's try the distance correlationfrom the dcorlibrary.

作为第二个示例，让我们尝试与库的距离相关性。dcor

Executing:

执行：

import dcor
dist_corr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt = iris_df.apply(lambda col1: iris_df.apply(lambda col2: dist_corr(col1, col2)))
pd.options.display.float_format = '{:,.2f}'.format
rslt

returns:

返回：

while:

尽管：

rslt_np = np.apply_along_axis(lambda col1: np.apply_along_axis(lambda col2: dcor.distance_correlation(col1, col2), 
                                                               axis = 0, arr=iris_df), 
                              axis =0, arr=iris_df)
float_formatter = lambda x: "%.2f" % x
np.set_printoptions(formatter={'float_kind':float_formatter})
rslt_np

returns:

返回：

array([[1.00, 0.31, 0.86, 0.83, 0.78],
       [0.31, 1.00, 0.54, 0.51, 0.51],
       [0.86, 0.54, 1.00, 0.97, 0.95],
       [0.83, 0.51, 0.97, 1.00, 0.95],
       [0.78, 0.51, 0.95, 0.95, 1.00]])

pandas 熊猫数据框中行的距离矩阵

提问by misterte

回答by S Anand

What is the % difference?

有什么不同？

What is the correlation coefficient?

什么是相关系数？

What is the Euclidean distance?

什么是欧几里得距离？

回答by maparent

回答by MyCarta

相关推荐

最近更新

标签

pandas 熊猫数据框中行的距离矩阵

提问by misterte

回答by S Anand

What is the % difference?

有什么不同？

What is the correlation coefficient?

什么是相关系数？

What is the Euclidean distance?

什么是欧几里得距离？

回答by maparent

回答by MyCarta

相关推荐

如何将 Pandas 中的变量指定为有序/分类？

pandas 在网格中绘制多个直方图

pandas 熊猫读取没有标题的 csv（可能在那里）

Python Pandas to_pickle 不能腌制大型数据帧

相关推荐

最近更新

标签