Python 使用 Numpy 高效计算欧几里得距离矩阵

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22720864/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:29:11  来源:igfitidea点击:

Efficiently Calculating a Euclidean Distance Matrix Using Numpy

pythonnumpymatrixperformanceeuclidean-distance

提问by Wes Modes

I have a set of points in 2-dimensional space and need to calculate the distance from each point to each other point.

我在二维空间中有一组点,需要计算每个点到另一个点的距离。

I have a relatively small number of points, maybe at most 100. But since I need to do it often and rapidly in order to determine the relationships between these moving points, and since I'm aware that iterating through the points could be as bad as O(n^2) complexity, I'm looking for ways to take advantage of numpy's matrix magic (or scipy).

我的点数相对较少,可能最多 100。但是因为我需要经常快速地确定这些移动点之间的关系,并且因为我知道遍历这些点可能会很糟糕由于 O(n^2) 复杂性,我正在寻找利用 numpy 的矩阵魔法(或 scipy)的方法。

As it stands in my code, the coordinates of each object are stored in its class. However, I could also update them in a numpy array when I update the class coordinate.

在我的代码中,每个对象的坐标都存储在它的类中。但是,当我更新类坐标时,我也可以在 numpy 数组中更新它们。

class Cell(object):
    """Represents one object in the field."""
    def __init__(self,id,x=0,y=0):
        self.m_id = id
        self.m_x = x
        self.m_y = y

It occurs to me to create a Euclidean distance matrix to prevent duplication, but perhaps you have a cleverer data structure.

我想到创建一个欧几里得距离矩阵来防止重复,但也许你有一个更聪明的数据结构。

I'm open to pointers to nifty algorithms as well.

我也愿意接受指向漂亮算法的指针。

Also, I note that there are similar questions dealing with Euclidean distance and numpy but didn't find any that directly address this question of efficiently populating a full distance matrix.

另外,我注意到有处理欧几里得距离和 numpy 的类似问题,但没有找到任何直接解决有效填充全距离矩阵的问题。

采纳答案by Kiwi

You can take advantage of the complextype :

您可以利用complex类型:

# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])

First solution

第一个解决方案

# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)

Second solution

第二种解决方案

Meshing is the main idea. But numpyis clever, so you don't have to generate m& n. Just compute the difference using a transposed version of z. The mesh is done automatically :

网格划分是主要思想。但是numpy很聪明,所以你不必生成m& n。只需使用 的转置版本计算差异z。网格是自动完成的:

out = abs(z[..., np.newaxis] - z)

Third solution

第三种解决方案

And if zis directly set as a 2-dimensional array, you can use z.Tinstead of the weird z[..., np.newaxis]. So finally, your code will look like this :

而如果z直接设置为二维数组,则可以使用z.T代替奇怪的z[..., np.newaxis]. 所以最后,您的代码将如下所示:

z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)

Example

例子

>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 2.23606798,  0.        ,  4.24264069],
       [ 4.12310563,  4.24264069,  0.        ]])

As a complement, you may want to remove duplicates afterwards, taking the upper triangle :

作为补充,您可能希望在之后删除重复项,取上三角形:

>>> np.triu(out)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 0.        ,  0.        ,  4.24264069],
       [ 0.        ,  0.        ,  0.        ]])

Some benchmarks

一些基准

>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686

回答by shx2

Here is how you can do it using numpy:

以下是使用 numpy 的方法:

import numpy as np

x = np.array([0,1,2])
y = np.array([2,4,6])

# take advantage of broadcasting, to make a 2dim array of diffs
dx = x[..., np.newaxis] - x[np.newaxis, ...]
dy = y[..., np.newaxis] - y[np.newaxis, ...]
dx
=> array([[ 0, -1, -2],
          [ 1,  0, -1],
          [ 2,  1,  0]])

# stack in one array, to speed up calculations
d = np.array([dx,dy])
d.shape
=> (2, 3, 3)

Now all is left is computing the L2-norm along the 0-axis (as discussed here):

现在剩下的就是计算沿 0 轴的 L2 范数(如这里所讨论的):

(d**2).sum(axis=0)**0.5
=> array([[ 0.        ,  2.23606798,  4.47213595],
          [ 2.23606798,  0.        ,  2.23606798],
          [ 4.47213595,  2.23606798,  0.        ]])

回答by Sturla Molden

If you don't need the full distance matrix, you will be better off using kd-tree. Consider scipy.spatial.cKDTreeor sklearn.neighbors.KDTree. This is because a kd-tree kan find k-nearnest neighbors in O(n log n) time, and therefore you avoid the O(n**2) complexity of computing all n by n distances.

如果您不需要全距离矩阵,最好使用 kd-tree。考虑scipy.spatial.cKDTreesklearn.neighbors.KDTree。这是因为 kd-tree kan 在 O(n log n) 时间内找到 k 最近邻,因此您避免了计算所有 n × n 距离的 O(n**2) 复杂度。

回答by Rich Pauloo

Jake Vanderplas gives this example using broadcasting in Python Data Science Handbook, which is very similar to what @shx2 proposed.

Jake Vanderplas 在Python Data Science Handbook 中使用广播给出了这个例子,这与@shx2 提出的非常相似。

import numpy as np
rand = random.RandomState(42)
X = rand.rand(3, 2)  
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis = -1)

dist_sq
array([[0.        , 0.18543317, 0.81602495],
       [0.18543317, 0.        , 0.22819282],
       [0.81602495, 0.22819282, 0.        ]])