Python 测试 Numpy 数组是否包含给定的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14766194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:22:18  来源:igfitidea点击:

testing whether a Numpy array contains a given row

pythonnumpy

提问by Nathaniel

Is there a Pythonic and efficient way to check whether a Numpy array contains at least one instance of a given row? By "efficient" I mean it terminates upon finding the first matching row rather than iterating over the entire array even if a result has already been found.

是否有一种 Pythonic 且有效的方法来检查 Numpy 数组是否包含给定行的至少一个实例?“高效”是指它在找到第一个匹配行时终止,而不是遍历整个数组,即使已经找到结果。

With Python arrays this can be accomplished very cleanly with if row in array:, but this does not work as I would expect for Numpy arrays, as illustrated below.

对于 Python 数组,可以使用 非常干净地完成if row in array:此操作,但这并不像我对 Numpy 数组所期望的那样工作,如下图所示。

With Python arrays:

使用 Python 数组:

>>> a = [[1,2],[10,20],[100,200]]
>>> [1,2] in a
True
>>> [1,20] in a
False

but Numpy arrays give different and rather odd-looking results. (The __contains__method of ndarrayseems to be undocumented.)

但是 Numpy 数组给出了不同且看起来很奇怪的结果。( 的__contains__方法ndarray似乎没有记录。)

>>> a = np.array([[1,2],[10,20],[100,200]])
>>> np.array([1,2]) in a
True
>>> np.array([1,20]) in a
True
>>> np.array([1,42]) in a
True
>>> np.array([42,1]) in a
False

采纳答案by seberg

Numpys __contains__is, at the time of writing this, (a == b).any()which is arguably only correct if bis a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(), which makes sense for all combinations of aand bdimensionality)...

__contains__在撰写本文时,(a == b).any()Numpys可以说只有在b是标量时才是正确的(它有点毛茸茸,但我相信 - 仅在 1.7 或更高版本中才能这样工作。这将是正确的通用方法(a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(),这使得ab维度的所有组合的意义)...

EDIT: Just to be clear, this is notnecessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in aseparately as np.in1ddoes. I am not sure there is one clear way it should work.

编辑:只是要清楚,这是不是一定预期的结果,当广播参与。也有人可能会争辩说它应该anp.in1d那样单独处理这些项目。我不确定它应该有一种明确的工作方式。

Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array. Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()).

现在您希望 numpy 在找到第一次出现时停止。该 AFAIK 目前不存在。这很困难,因为 numpy 主要基于 ufuncs,它在整个数组上做同样的事情。Numpy 确实优化了这些类型的缩减,但只有在被缩减的数组已经是布尔数组(即np.ones(10, dtype=bool).any())时才有效。

Otherwise it would need a special function for __contains__which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__or such specially is not actually that trivial because of data types.

否则,它将需要一个__contains__不存在的特殊功能。这可能看起来很奇怪,但您必须记住 numpy 支持许多数据类型,并且有一个更大的机制来选择正确的数据类型并选择正确的函数来处理它。因此,换句话说,ufunc 机制无法做到这一点,并且__contains__由于数据类型的原因,实现或此类特殊实际上并不是那么简单。

You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.

你当然可以用 python 编写它,或者因为你可能知道你的数据类型,所以用 Cython/C 自己编写它非常简单。



That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsortedfor a lexsort, but it works (you could also abuse scipy.spatial.cKDTreeif you like). This assumes you want to compare along the last axis only:

那说。通常,对这些事情使用基于排序的方法要好得多。这是一个有点乏味,以及有因为没有这样的事searchsortedlexsort,但它的工作原理(你也可以滥用scipy.spatial.cKDTree,如果你喜欢)。这假设您只想沿最后一个轴进行比较:

# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()

# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.

sorted.sort()

b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)

result = sorted[ind] == b_comp

This works also for an array b, and if you keep the sorted array around, is also much better if you do it for a single value (row) in bat a time, when astays the same (otherwise I would just np.in1dafter viewing it as a recarray). Important:you must do the np.ascontiguousarrayfor safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.

这也适用于数组b,如果您保留已排序的数组,如果您一次为单个值(行)执行b此操作,并且a保持不变(否则我会np.in1d在将其视为一个重新排列)。重要:np.ascontiguousarray为了安全,您必须这样做。它通常什么都不做,但如果它做了,那将是一个很大的潜在错误。

回答by tom10

I think

我认为

equal([1,2], a).all(axis=1)   # also,  ([1,2]==a).all(axis=1)
# array([ True, False, False], dtype=bool)

will list the rows that match. As Jamie points out, to know whether at least one such row exists, use any:

将列出匹配的行。正如杰米指出的那样,要知道是否至少存在一个这样的行,请使用any

equal([1,2], a).all(axis=1).any()
# True

Aside:
I suspect in(and __contains__) is just as above but using anyinstead of all.

旁白:
我怀疑in(和__contains__) 和上面一样,只是使用anyall.

回答by tom10

You can use .tolist()

您可以使用 .tolist()

>>> a = np.array([[1,2],[10,20],[100,200]])
>>> [1,2] in a.tolist()
True
>>> [1,20] in a.tolist()
False
>>> [1,20] in a.tolist()
False
>>> [1,42] in a.tolist()
False
>>> [42,1] in a.tolist()
False

Or use a view:

或者使用视图:

>>> any((a[:]==[1,2]).all(1))
True
>>> any((a[:]==[1,20]).all(1))
False

Or generate over the numpy list (potentially VERY SLOW):

或者在 numpy 列表上生成(可能非常慢):

any(([1,2] == x).all() for x in a)     # stops on first occurrence 

Or use numpy logic functions:

或者使用 numpy 逻辑函数:

any(np.equal(a,[1,2]).all(1))

If you time these:

如果你计时这些:

import numpy as np
import time

n=300000
a=np.arange(n*3).reshape(n,3)
b=a.tolist()

t1,t2,t3=a[n//100][0],a[n//2][0],a[-10][0]

tests=[ ('early hit',[t1, t1+1, t1+2]),
        ('middle hit',[t2,t2+1,t2+2]),
        ('late hit', [t3,t3+1,t3+2]),
        ('miss',[0,2,0])]

fmt='\t{:20}{:.5f} seconds and is {}'     

for test, tgt in tests:
    print('\n{}: {} in {:,} elements:'.format(test,tgt,n))

    name='view'
    t1=time.time()
    result=(a[...]==tgt).all(1).any()
    t2=time.time()
    print(fmt.format(name,t2-t1,result))

    name='python list'
    t1=time.time()
    result = True if tgt in b else False
    t2=time.time()
    print(fmt.format(name,t2-t1,result))

    name='gen over numpy'
    t1=time.time()
    result=any((tgt == x).all() for x in a)
    t2=time.time()
    print(fmt.format(name,t2-t1,result))

    name='logic equal'
    t1=time.time()
    np.equal(a,tgt).all(1).any()
    t2=time.time()
    print(fmt.format(name,t2-t1,result))

You can see that hit or miss, the numpy routines are the same speed to search the array. The Python inoperator is potentiallya lot faster for an early hit, and the generator is just bad news if you have to go all the way through the array.

您可以看到命中或未命中,numpy 例程以相同的速度搜索数组。Pythonin运算符对于早期命中可能要快得多,如果您必须一直遍历数组,则生成器只是个坏消息。

Here are the results for 300,000 x 3 element array:

以下是 300,000 x 3 元素数组的结果:

early hit: [9000, 9001, 9002] in 300,000 elements:
    view                0.01002 seconds and is True
    python list         0.00305 seconds and is True
    gen over numpy      0.06470 seconds and is True
    logic equal         0.00909 seconds and is True

middle hit: [450000, 450001, 450002] in 300,000 elements:
    view                0.00915 seconds and is True
    python list         0.15458 seconds and is True
    gen over numpy      3.24386 seconds and is True
    logic equal         0.00937 seconds and is True

late hit: [899970, 899971, 899972] in 300,000 elements:
    view                0.00936 seconds and is True
    python list         0.30604 seconds and is True
    gen over numpy      6.47660 seconds and is True
    logic equal         0.00965 seconds and is True

miss: [0, 2, 0] in 300,000 elements:
    view                0.00936 seconds and is False
    python list         0.01287 seconds and is False
    gen over numpy      6.49190 seconds and is False
    logic equal         0.00965 seconds and is False

And for 3,000,000 x 3 array:

对于 3,000,000 x 3 阵列:

early hit: [90000, 90001, 90002] in 3,000,000 elements:
    view                0.10128 seconds and is True
    python list         0.02982 seconds and is True
    gen over numpy      0.66057 seconds and is True
    logic equal         0.09128 seconds and is True

middle hit: [4500000, 4500001, 4500002] in 3,000,000 elements:
    view                0.09331 seconds and is True
    python list         1.48180 seconds and is True
    gen over numpy      32.69874 seconds and is True
    logic equal         0.09438 seconds and is True

late hit: [8999970, 8999971, 8999972] in 3,000,000 elements:
    view                0.09868 seconds and is True
    python list         3.01236 seconds and is True
    gen over numpy      65.15087 seconds and is True
    logic equal         0.09591 seconds and is True

miss: [0, 2, 0] in 3,000,000 elements:
    view                0.09588 seconds and is False
    python list         0.12904 seconds and is False
    gen over numpy      64.46789 seconds and is False
    logic equal         0.09671 seconds and is False

Which seems to indicate that np.equalis the fastest pure numpy way to do this...

这似乎表明这np.equal是最快的纯 numpy 方式来做到这一点......

回答by Bálint Aradi

If you really want to stop at the first occurrence, you could write a loop, like:

如果您真的想在第一次出现时停止,您可以编写一个循环,例如:

import numpy as np

needle = np.array([10, 20])
haystack = np.array([[1,2],[10,20],[100,200]])
found = False
for row in haystack:
    if np.all(row == needle):
        found = True
        break
print("Found: ", found)

However, I strongly suspect, that it will be much slower than the other suggestions which use numpy routines to do it for the whole array.

但是,我强烈怀疑,它会比使用 numpy 例程为整个数组执行此操作的其他建议慢得多。