Python 删除numpy数组的重复行

Question

提问by thunder

How can I remove duplicate rows of a 2 dimensional numpyarray?

如何删除二维numpy数组的重复行？

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

The answer should be as follows:

答案应该如下：

ans = array([[1,8,3,3,4],
             [1,8,9,9,4]])

If there are two rows that are the same, then I would like to remove one "duplicate" row.

如果有两行相同，那么我想删除一个“重复”行。

Answer 1

采纳答案by ThePredator

You can use numpy unique. Since you want the unique rows, we need to put them into tuples:

您可以使用numpy unique. 由于您想要唯一的行，我们需要将它们放入元组中：

import numpy as np

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

just applying np.uniqueto the dataarray will result in this:

只是应用np.unique到data阵列会导致这样的：

>>> uniques
array([1, 3, 4, 8, 9])

prints out the unique elements in the list. So putting them into tuples results in:

打印出列表中的唯一元素。因此，将它们放入元组会导致：

new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)

which prints:

打印：

>>> uniques
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

Answer 2

回答by omerbp

A simple solution can be:

一个简单的解决方案可以是：

import numpy as np
def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])


print unique_rows(data)
#prints:
[[1 8 3 3 4]
 [1 8 9 9 4]]

You can check thisfor many more solutions for this problem

您可以检查这个对于这个问题有更多的解决方案

Answer 3

回答by Divakar

One approach with lex-sorting-

一种方法lex-sorting-

# Perform lex sort and get sorted data
sorted_idx = np.lexsort(data.T)
sorted_data =  data[sorted_idx,:]

# Get unique row mask
row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))

# Get unique rows
out = sorted_data[row_mask]

Sample run -

样品运行 -

In [199]: data
Out[199]: 
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 0, 3, 4],
       [1, 8, 9, 9, 4]])

In [200]: sorted_idx = np.lexsort(data.T)
     ...: sorted_data =  data[sorted_idx,:]
     ...: row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
     ...: out = sorted_data[row_mask]
     ...: 

In [201]: out
Out[201]: 
array([[1, 8, 0, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

Runtime tests -

运行时测试 -

This section times all approaches proposed in the solutions presented thus far.

本节对迄今为止提出的解决方案中提出的所有方法进行计时。

In [34]: data = np.random.randint(0,10,(10000,10))

In [35]: def tuple_based(data):
    ...:     new_array = [tuple(row) for row in data]
    ...:     return np.unique(new_array)
    ...: 
    ...: def lexsort_based(data):                 
    ...:     sorted_data =  data[np.lexsort(data.T),:]
    ...:     row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
    ...:     return sorted_data[row_mask]
    ...: 
    ...: def unique_based(a):
    ...:     a = np.ascontiguousarray(a)
    ...:     unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    ...:     return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
    ...: 

In [36]: %timeit tuple_based(data)
10 loops, best of 3: 63.1 ms per loop

In [37]: %timeit lexsort_based(data)
100 loops, best of 3: 8.92 ms per loop

In [38]: %timeit unique_based(data)
10 loops, best of 3: 29.1 ms per loop

Python 删除numpy数组的重复行

提问by thunder

采纳答案by ThePredator

回答by omerbp

回答by Divakar

相关推荐

最近更新

标签

Python 删除numpy数组的重复行

提问by thunder

采纳答案by ThePredator

回答by omerbp

回答by Divakar

相关推荐

Python 仅为主机创建 VPN 连接

Python3 错误：initial_value 必须是 str 或 None，带有 StringIO

Python Numpy argsort - 它在做什么？

Python 2.7：无法导入 matplotlib.pyplot

相关推荐

最近更新

标签