python中最快的成对距离度量

Question

提问by roblanf

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.

我有一个一维数字数组，想计算所有成对的欧几里德距离。我有一种方法（感谢 SO）通过广播来做到这一点，但效率低下，因为它计算每个距离两次。它不能很好地扩展。

Here's an example that gives me what I want with an array of 1000 numbers.

这是一个示例，它为我提供了一个包含 1000 个数字的数组。

import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])

What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.

scipy/numpy/scikit-learn 中我可以用来执行此操作的最快实现是什么，因为它必须扩展到一维数组具有 >10k 值的情况。

Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.

注意：矩阵是对称的，所以我猜通过解决这个问题可以获得至少 2 倍的加速，我只是不知道如何。

Answer 1

采纳答案by roblanf

Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdistis the way to go.

其他答案都没有完全回答这个问题 - 1 在 Cython 中，一个较慢。但两者都提供了非常有用的提示。跟进他们表明这scipy.spatial.distance.pdist是要走的路。

Here's some code:

这是一些代码：

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

def option1(r):
    dists = np.abs(r - r[:, None])

def option2(r):
    dists = scipy.spatial.distance.pdist(r, 'cityblock')

def option3(r):
    dists = sklearn.metrics.pairwise.manhattan_distances(r)

Timing with IPython:

使用 IPython 计时：

In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop

In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop

In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop

I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdistis roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).

我没有尝试 Cython 实现（我不能在这个项目中使用它），但是将我的结果与其他答案进行比较，它看起来scipy.spatial.distance.pdist比 Cython 实现慢了大约三分之一（考虑到不同的机器通过对 np.abs 解决方案进行基准测试）。

Answer 2

回答by Saullo G. P. Castro

Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.

这是一个 Cython 实现，它在我的计算机上为这个示例提供了 3 倍以上的速度改进。对于更大的数组，应该检查这个时间，因为 BLAS 例程可能比这个相当幼稚的代码更好地扩展。

I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:

我知道你在 scipy/numpy/scikit-learn 中要求一些东西，但也许这会为你打开新的可能性：

File my_cython.pyx:

文件my_cython.pyx：

import numpy as np
cimport numpy as np
import cython

cdef extern from "math.h":
    double abs(double t)

@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
    cdef int i, j, c, size
    cdef np.ndarray[np.double_t, ndim=1] ans
    size = sum(range(1, r.shape[0]+1))
    ans = np.empty(size, dtype=r.dtype)
    c = -1
    for i in range(r.shape[0]):
        for j in range(i, r.shape[0]):
            c += 1
            ans[c] = abs(r[i] - r[j])
    return ans

The answer is a 1-D array containing all non-repeated evaluations.

答案是包含所有非重复评估的一维数组。

To import into Python:

要导入 Python：

import numpy as np
import random

import pyximport; pyximport.install()
from my_cython import pairwise_distance

r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)

def solOP(r):
    return np.abs(r - r[:, None])

Timing with IPython:

使用 IPython 计时：

In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop

In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop

Answer 3

回答by cyborg

Using half the memory, but 6 times slower than np.abs(r - r[:, None]):

使用一半的内存，但比慢 6 倍np.abs(r - r[:, None])：

triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])

python中最快的成对距离度量

提问by roblanf

采纳答案by roblanf

回答by Saullo G. P. Castro

回答by cyborg

相关推荐

最近更新

标签

python中最快的成对距离度量

提问by roblanf

采纳答案by roblanf

回答by Saullo G. P. Castro

回答by cyborg

相关推荐

Python 使用循环创建一个列表，用 float() 填充它

Python：内联 if 语句 else 什么都不做

Python 使用 Flask 从 WTForm 验证电子邮件

Python 熊猫计数（不同）等价物

相关推荐

最近更新

标签