pandas 在 Python 中的两个列表/数组中查找最近的项目

Question

提问by Jaime

I have two numpy arrays xand ycontaining float values. For each value in x, I want to find the closest element in y, without reusing elements from y. The output should be a 1-1 mapping of indices of elements from x to indices of elements from y. Here's a bad way to do it that relies on sorting. It removes each element that was paired from the list. Without sorting this would be bad because the pairing would depend on the order of the original input arrays.

我有两个 numpy 数组x并y包含浮点值。对于中的每个值x，我想在中找到最接近的元素y，而不重用中的元素y。输出应该是从 x 的元素索引到来自 y 的元素索引的 1-1 映射。这是一种依赖排序的糟糕方法。它从列表中删除每个配对的元素。不排序会很糟糕，因为配对将取决于原始输入数组的顺序。

def min_i(values):
    min_index, min_value = min(enumerate(values),
                               key=operator.itemgetter(1))
    return min_index, min_value

# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10

# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)

pairs = []
indx_to_search = range(len(y))

for x_indx, x_item in enumerate(x):
    if len(indx_to_search) == 0:
        print "ran out of items to match..."
        break
    # until match is found look for closest item
    possible_values = y[indx_to_search]
    nearest_indx, nearest_item = min_i(possible_values)
    orig_indx = indx_to_search[nearest_indx]
    # remove it
    indx_to_search.remove(orig_indx)
    pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
    print x[k], " paired with ", y[v]

I prefer to do it without sorting the elements first, but if they are sorted then I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy/scipy/Python or using pandas? thanks.

我更喜欢在不先对元素进行排序的情况下执行此操作，但是如果它们已排序，那么我想获取原始未排序列表中的索引unsorted_x, unsorted_y。在 numpy/scipy/Python 中或使用 Pandas 执行此操作的最佳方法是什么？谢谢。

edit: to clarify I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I assume that yis generally much larger than xcontrary to above example and so there's usually many very good fits for each value of xin y, and I just want to find that one efficiently.

编辑：澄清一下，我不是试图找到所有元素的最佳拟合（例如，不是最小化距离总和），而是每个元素的最佳拟合，如果有时以牺牲其他元素为代价，那也没关系。我认为这y通常比x上面的示例大得多，因此对于xin 的每个值通常有很多非常好的拟合y，我只想有效地找到那个。

can someone show an example of scipy kdtrees for this? The docs are quite sparse

有人可以为此展示一个 scipy kdtrees 的例子吗？文档非常稀少

kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg

Answer 1

回答by Jaime

EDIT 2A solution using KDTreecan perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:

编辑 2KDTree如果您可以选择多个邻居来保证数组中的每个项目都有唯一的邻居，则使用的解决方案可以执行得非常好。使用以下代码：

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

and a sample of n=1000points, I get:

和一个n=1000点样本，我得到：

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

So the optimum is k=13, and then the timing is:

所以最佳是k=13，然后时间是：

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

But in the worst case, you could need k=1000, and then:

但在最坏的情况下，您可能需要k=1000, 然后：

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

Which is slower than the other options:

这比其他选项慢：

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop

EDITSorting the array before searching pays off for arrays of more than 1000 items:

编辑在搜索之前对数组进行排序可以为超过 1000 个项目的数组带来回报：

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

With a 10000 element long array:

使用 10000 个元素的长数组：

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

For smaller arrays it performs slightly worse.

对于较小的阵列，它的性能稍差。

You are going to have to loop over all your items to implement your greedynearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:

如果只是为了丢弃重复项，您将不得不遍历所有项目以实现您的贪婪最近邻算法。考虑到这一点，这是我能想到的最快的：

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

And now with:

现在：

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

I get:

我得到：

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop

pandas 在 Python 中的两个列表/数组中查找最近的项目

提问by Jaime

回答by Jaime

相关推荐

最近更新

标签

pandas 在 Python 中的两个列表/数组中查找最近的项目

提问by Jaime

回答by Jaime

相关推荐

pandas 将熊猫的 to_html 保存为文件

在 Pandas to_html 中格式化输出数据

如何从 Pandas 绘图函数返回一个 matplotlib.figure.Figure 对象？

基于列标签在 Pandas 中重塑数据框

相关推荐

最近更新

标签