使用距离矩阵计算 Pandas Dataframe 中行之间的距离

Question

提问by cmiller8

I have the following Pandas DataFrame:

我有以下 Pandas DataFrame：

In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
      Sym1 Sym2 Sym3 Sym4
Item1    a    a    a    b
Item2    a    c    c    b
Item3    a    b    b    b
Item4    d    b    d    a

and I want to find the elegant way to get the distance between each Itemaccording to this distance matrix:

我想找到一种优雅的方法来Item根据这个距离矩阵获得每个人之间的距离：

In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
      a     b     c     d
a  0.00  0.00  0.67  1.34
b  0.00  0.00  0.00  0.67
c  0.67  0.00  0.00  0.00
d  1.34  0.67  0.00  0.00

For example comparing Item1to Item2would compare aaab-> accb-- using the distance matrix this would be 0+0.67+0.67+0=1.34

例如比较Item1，以Item2将比较aaab- > accb-利用所述距离矩阵，这将是0+0.67+0.67+0=1.34

Ideal output:

理想输出：

       Item1   Item2  Item3  Item4
Item1      0    1.34     0    2.68
Item2     1.34    0      0    1.34
Item3      0      0      0    2.01
Item4     2.68  1.34   2.01    0

Answer 1

采纳答案by behzad.nouri

this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )

这是需要做的两倍的工作，但技术上也适用于非对称距离矩阵（无论这意味着什么）

pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
                                  for (x, y) in zip( row1, row2 ) ) 
                         for (idx2, row2) in sample.iterrows( ) } 
                 for (idx1, row1 ) in sample.iterrows( ) } )

you can make it more readable by writing it in pieces:

您可以通过将其分成几部分来使其更具可读性：

# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )

# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }

# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )

Answer 2

回答by shadowtalker

This is an old question, but there is a Scipy function that does this:

这是一个老问题，但有一个 Scipy 函数可以做到这一点：

from scipy.spatial.distance import pdist, squareform

distances = pdist(sample.values, metric='euclidean')
dist_matrix = squareform(distances)

pdistoperates on Numpy matrices, and DataFrame.valuesis the underlying Numpy NDarray representation of the data frame. The metricargument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareformthen translates this flattened form into a full matrix.

pdist对 Numpy 矩阵进行操作，并且DataFrame.values是数据帧的底层 Numpy NDarray 表示。该metric参数允许您选择几个内置距离度量之一，或者您可以传入任何二元函数以使用自定义距离。它非常强大，而且根据我的经验，速度非常快。结果是一个“平面”数组，它只包含距离矩阵的上三角形（因为它是对称的），不包括对角线（因为它总是 0）。squareform然后将此扁平形式转换为完整矩阵。

The docshave more info, including a mathematical rundown of the many built-in distance functions.

该文档有更多的信息，其中包括了许多内置的距离函数的数学纲要。

Answer 3

回答by Michelle Owen

For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.

对于大数据，我找到了一种快速的方法来做到这一点。假设您的数据已经是 np.array 格式，命名为 a。

from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(a, a)

Below is an experiment to compare the time needed for two approaches:

以下是比较两种方法所需时间的实验：

a = np.random.rand(1000,1000)
import time 
time1 = time.time()
distances = pdist(a, metric='euclidean')
dist_matrix = squareform(distances)
time2 = time.time()
time2 - time1  #0.3639109134674072

time1 = time.time()
dist = euclidean_distances(a, a)
time2 = time.time()
time2-time1  #0.08735871315002441

使用距离矩阵计算 Pandas Dataframe 中行之间的距离

提问by cmiller8

采纳答案by behzad.nouri

回答by shadowtalker

回答by Michelle Owen

相关推荐

最近更新

标签

使用距离矩阵计算 Pandas Dataframe 中行之间的距离

提问by cmiller8

采纳答案by behzad.nouri

回答by shadowtalker

回答by Michelle Owen

相关推荐

将多索引排序到全深度（Pandas）

如何使用 Pandas read_html 和请求库来读取表？

Pandas：如何访问索引的值

pandas 将日期列和时间列合并为日期时间列

相关推荐

最近更新

标签