python中最快的成对距离度量
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20277982/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fastest pairwise distance metric in python
提问by roblanf
I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method (thanks to SO) of doing this with broadcasting, but it's inefficient because it calculates each distance twice. And it doesn't scale well.
我有一个一维数字数组,想计算所有成对的欧几里德距离。我有一种方法(感谢 SO)通过广播来做到这一点,但效率低下,因为它计算每个距离两次。它不能很好地扩展。
Here's an example that gives me what I want with an array of 1000 numbers.
这是一个示例,它为我提供了一个包含 1000 个数字的数组。
import numpy as np
import random
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
dists = np.abs(r - r[:, None])
What's the fastest implementation in scipy/numpy/scikit-learn that I can use to do this, given that it has to scale to situations where the 1D array has >10k values.
scipy/numpy/scikit-learn 中我可以用来执行此操作的最快实现是什么,因为它必须扩展到一维数组具有 >10k 值的情况。
Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how.
注意:矩阵是对称的,所以我猜通过解决这个问题可以获得至少 2 倍的加速,我只是不知道如何。
采纳答案by roblanf
Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy.spatial.distance.pdistis the way to go.
其他答案都没有完全回答这个问题 - 1 在 Cython 中,一个较慢。但两者都提供了非常有用的提示。跟进他们表明这scipy.spatial.distance.pdist是要走的路。
Here's some code:
这是一些代码:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
def option1(r):
dists = np.abs(r - r[:, None])
def option2(r):
dists = scipy.spatial.distance.pdist(r, 'cityblock')
def option3(r):
dists = sklearn.metrics.pairwise.manhattan_distances(r)
Timing with IPython:
使用 IPython 计时:
In [36]: timeit option1(r)
100 loops, best of 3: 5.31 ms per loop
In [37]: timeit option2(c)
1000 loops, best of 3: 1.84 ms per loop
In [38]: timeit option3(c)
100 loops, best of 3: 11.5 ms per loop
I didn't try the Cython implementation (I can't use it for this project), but comparing my results to the other answer that did, it looks like scipy.spatial.distance.pdistis roughly a third slower than the Cython implementation (taking into account the different machines by benchmarking on the np.abs solution).
我没有尝试 Cython 实现(我不能在这个项目中使用它),但是将我的结果与其他答案进行比较,它看起来scipy.spatial.distance.pdist比 Cython 实现慢了大约三分之一(考虑到不同的机器通过对 np.abs 解决方案进行基准测试)。
回答by Saullo G. P. Castro
Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer. This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code.
这是一个 Cython 实现,它在我的计算机上为这个示例提供了 3 倍以上的速度改进。对于更大的数组,应该检查这个时间,因为 BLAS 例程可能比这个相当幼稚的代码更好地扩展。
I know you asked for something inside scipy/numpy/scikit-learn, but maybe this will open new possibilities for you:
我知道你在 scipy/numpy/scikit-learn 中要求一些东西,但也许这会为你打开新的可能性:
File my_cython.pyx:
文件my_cython.pyx:
import numpy as np
cimport numpy as np
import cython
cdef extern from "math.h":
double abs(double t)
@cython.wraparound(False)
@cython.boundscheck(False)
def pairwise_distance(np.ndarray[np.double_t, ndim=1] r):
cdef int i, j, c, size
cdef np.ndarray[np.double_t, ndim=1] ans
size = sum(range(1, r.shape[0]+1))
ans = np.empty(size, dtype=r.dtype)
c = -1
for i in range(r.shape[0]):
for j in range(i, r.shape[0]):
c += 1
ans[c] = abs(r[i] - r[j])
return ans
The answer is a 1-D array containing all non-repeated evaluations.
答案是包含所有非重复评估的一维数组。
To import into Python:
要导入 Python:
import numpy as np
import random
import pyximport; pyximport.install()
from my_cython import pairwise_distance
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float)
def solOP(r):
return np.abs(r - r[:, None])
Timing with IPython:
使用 IPython 计时:
In [2]: timeit solOP(r)
100 loops, best of 3: 7.38 ms per loop
In [3]: timeit pairwise_distance(r)
1000 loops, best of 3: 1.77 ms per loop
回答by cyborg
Using half the memory, but 6 times slower than np.abs(r - r[:, None]):
使用一半的内存,但比 慢 6 倍np.abs(r - r[:, None]):
triu = np.triu_indices(r.shape[0],1)
dists2 = abs(r[triu[1]]-r[triu[0]])

