Python 如何有效地设置 Numpy 数组？

Question

提问by ALH

I used:

我用了：

df['ids'] = df['ids'].values.astype(set)

to turn listsinto sets, but the output was a list not a set:

转lists成sets，但输出是一个列表不是一组：

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

有没有一种有效的方法可以将 list 转换为 set in Numpy？

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

编辑 1：
我的输入如下所示：
我有 3,000 条记录。每个都有 30,000 个 ID：[[1,...,12,13,...,30000], [1,...,43,45,...,30000],...,[...] ]

Answer 1

采纳答案by Andras Deak

The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?

您问题的当前状态（可以随时更改）：如何有效地从大型数组的大型数组中删除唯一元素？

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

IPython shell 中的运行时：

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Update: as @hpauljpointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

更新：正如@hpaulj在他的评论中指出的那样，我的虚拟示例是有偏见的，因为浮点随机数几乎肯定是唯一的。所以这里有一个更逼真的整数示例：

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

在这种情况下，输出列表的元素具有不同的长度，因为要删除实际的重复项。

Answer 2

回答by P. Camilleri

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

首先展平 ndarray 以获得一维数组，然后在其上应用 set() ：

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x]to obtain a list of sets.

编辑：由于您似乎只想要一个集合数组，而不是整个数组的集合，那么您可以value = [set(v) for v in x]获取集合列表。

Answer 3

回答by hpaulj

A couple of earlier 'row-wise' unique questions:

几个较早的“按行”的独特问题：

vectorize numpy unique for subarrays

对子数组唯一的向量化 numpy

Numpy: Row Wise Unique elements

Numpy：行明智的独特元素

Count unique elements row wise in an ndarray

在ndarray中按行计算唯一元素

In a couple of these the count is more interesting than the actual unique values.

在其中的几个中，计数比实际的唯一值更有趣。

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

如果每行唯一值的数量不同，则结果不能是 (2d) 数组。这是一个很好的迹象，表明问题不能完全矢量化。您需要对行进行某种迭代。

Python 如何有效地设置 Numpy 数组？

提问by ALH

采纳答案by Andras Deak

回答by P. Camilleri

回答by hpaulj

相关推荐

最近更新

标签

Python 如何有效地设置 Numpy 数组？

提问by ALH

采纳答案by Andras Deak

回答by P. Camilleri

回答by hpaulj

相关推荐

Python 为什么 foo = filter(...) 返回一个 <filter object>，而不是一个列表？

Python Pandas Merge - 如何避免重复列

Python 请求获取 ('Connection aborted.', BadStatusLine("''",)) 错误

Python 以最简单的方式向 Matplotlib 中的 PyPlot 添加图例

相关推荐

最近更新

标签