Python 在 numpy.array 中查找唯一行

Question

提问by Akavall

I need to find unique rows in a numpy.array.

我需要在numpy.array.

For example:

例如：

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

I know that i can create a set and loop over the array, but I am looking for an efficient pure numpysolution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.

我知道我可以创建一个集合并在数组上循环，但我正在寻找一个有效的纯numpy解决方案。我相信有一种方法可以将数据类型设置为 void 然后我可以使用numpy.unique，但我不知道如何使它工作。

Answer 1

采纳答案by aiwabdn

As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:

从 NumPy 1.13 开始，可以简单地选择轴来选择任何 N-dim 数组中的唯一值。要获得唯一的行，可以执行以下操作：

unique_rows = np.unique(original_array, axis=0)

Answer 2

回答by codeape

np.unique works given a list of tuples:

np.unique 给出了一个元组列表：

>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]: 
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])

With a list of lists it raises a TypeError: unhashable type: 'list'

通过列表列表，它引发了一个 TypeError: unhashable type: 'list'

Answer 3

回答by Ryan Saxe

np.uniquewhen I run it on np.random.random(100).reshape(10,10)returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:

np.unique当我运行它时会np.random.random(100).reshape(10,10)返回所有唯一的单个元素，但您需要唯一的行，因此首先您需要将它们放入元组中：

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"

这是我看到您更改类型以执行您想要的操作的唯一方法，并且我不确定将列表迭代更改为元组是否适合您的“不循环”

Answer 4

回答by Joe Kington

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.

如果您想避免转换为一系列元组或其他类似数据结构的内存开销，您可以利用 numpy 的结构化数组。

The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.

诀窍是将原始数组视为结构化数组，其中每个项目对应于原始数组的一行。这不会复制，并且非常有效。

As a quick example:

举个简单的例子：

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

To understand what's going on, have a look at the intermediary results.

要了解发生了什么，请查看中间结果。

Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)

一旦我们将事物视为结构化数组，数组中的每个元素都是原始数组中的一行。（基本上，它是一个类似于元组列表的数据结构。）

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

Once we run numpy.unique, we'll get a structured array back:

一旦我们运行numpy.unique，我们将得到一个结构化数组：

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

That we then need to view as a "normal" array (_stores the result of the last calculation in ipython, which is why you're seeing _.view...):

然后我们需要将其视为“正常”数组（_将上次计算的结果存储在中ipython，这就是您看到的原因_.view...）：

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

And then reshape back into a 2D array (-1is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):

然后重塑回一个二维数组（-1是一个占位符，它告诉 numpy 计算正确的行数，给出列数）：

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

Obviously, if you wanted to be more concise, you could write it as:

显然，如果你想更简洁，你可以把它写成：

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

Which results in:

结果是：

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

Answer 5

回答by cge

np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:

np.unique 的工作原理是对扁平数组进行排序，然后查看每个项目是否等于前一个项目。这可以手动完成而无需展平：

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

This method does not use tuples, and should be much faster and simpler than other methods given here.

这种方法不使用元组，应该比这里给出的其他方法更快更简单。

NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this doesmake a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:

注意：以前的版本在 a[ 之后没有 ind，这意味着使用了错误的索引。此外，Joe Kington 提出了一个很好的观点，即这确实制作了各种中间副本。以下方法通过制作排序副本然后使用它的视图来减少：

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

This is faster and uses less memory.

这更快并且使用更少的内存。

Also, if you want to find unique rows in an ndarray regardlessof how many dimensions are in the array, the following will work:

此外，如果您想在 ndarray 中找到唯一的行，而不管数组中有多少维，则以下操作将起作用：

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.

一个有趣的遗留问题是，如果您想沿任意维度数组的任意轴排序/唯一，这将更加困难。

Edit:

编辑：

To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With yourexact a, there isn't too much of a difference, though this version is a bit faster:

为了演示速度差异，我在 ipython 中对答案中描述的三种不同方法进行了一些测试。与您的确切 a 没有太大区别，尽管此版本要快一些：

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

With a larger a, however, this version ends up being much, much faster:

然而，有了更大的 a，这个版本最终会快得多：

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

Answer 6

回答by Jaime

Another option to the use of structured arrays is using a view of a voidtype that joins the whole row into a single item:

使用结构化数组的另一个选择是使用void将整行连接成单个项目的类型的视图：

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

EDITAdded np.ascontiguousarrayfollowing @seberg's recommendation. This will slow the method down if the array is not already contiguous.

编辑np.ascontiguousarray按照@seberg 的建议添加。如果数组不是连续的，这将减慢方法的速度。

EDITThe above can be slightly sped up, perhaps at the cost of clarity, by doing:

编辑以上可以通过执行以下操作稍微加快速度，可能会以清晰度为代价：

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:

此外，至少在我的系统上，它的性能与 lexsort 方法相当，甚至更好：

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

Answer 7

回答by Greg von Winckel

Yet another possible solution

另一个可能的解决方案

np.vstack({tuple(row) for row in a})

Answer 8

回答by Arash_D_B

Based on the answer in this page I have written a function that replicates the capability of MATLAB's unique(input,'rows')function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:]and data = c[ic,:]. Please report if you see any discrepancies or errors.

根据本页中的答案，我编写了一个函数来复制 MATLAB 函数的unique(input,'rows')功能，并具有接受检查唯一性的容差的附加功能。它还返回索引，使得c = data[ia,:]和data = c[ic,:]。如果您发现任何差异或错误，请报告。

def unique_rows(data, prec=5):
    import numpy as np
    d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
    b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
    _, ia = np.unique(b, return_index=True)
    _, ic = np.unique(b, return_inverse=True)
    return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

Answer 9

回答by divenex

Here is another variation for @Greg pythonic answer

这是@Greg pythonic 答案的另一个变体

np.vstack(set(map(tuple, a)))

Answer 10

回答by kalu

Why not use drop_duplicatesfrom pandas:

为什么不drop_duplicates从熊猫使用：

>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop

>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop

Answer 11

回答by Eelco Hoogendoorn

The numpy_indexedpackage (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:

该numpy_indexed包（免责声明：我是它的作者）包装由Jaime在一个不错的发布解决方案和测试界面，再加上还有更多的功能：

import numpy_indexed as npi
new_a = npi.unique(a)  # unique elements over axis=0 (rows) by default

Python 在 numpy.array 中查找唯一行

提问by Akavall

采纳答案by aiwabdn

回答by codeape

回答by Ryan Saxe

回答by Joe Kington

回答by cge

回答by Jaime

回答by Greg von Winckel

回答by Arash_D_B

回答by divenex

回答by kalu

回答by Eelco Hoogendoorn

相关推荐

最近更新

标签

Python 在 numpy.array 中查找唯一行

提问by Akavall

采纳答案by aiwabdn

回答by codeape

回答by Ryan Saxe

回答by Joe Kington

回答by cge

回答by Jaime

回答by Greg von Winckel

回答by Arash_D_B

回答by divenex

回答by kalu

回答by Eelco Hoogendoorn

相关推荐

Python 重命名未命名的列熊猫数据框

Python 到 JSON 序列化在十进制上失败

Python django.core.exceptions.ImproperlyConfigured

让 Python 打印一天中的一小时

相关推荐

最近更新

标签