Python 如何有效地设置 Numpy 数组?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33196102/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:59:37  来源:igfitidea点击:

How to turn Numpy array to set efficiently?

pythonnumpyset

提问by ALH

I used:

我用了:

df['ids'] = df['ids'].values.astype(set)

to turn listsinto sets, but the output was a list not a set:

listssets,但输出是一个列表不是一组:

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

有没有一种有效的方法可以将 list 转换为 set in Numpy

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

编辑 1:
我的输入如下所示:
我有 3,000 条记录。每个都有 30,000 个 ID:[[1,...,12,13,...,30000], [1,...,43,45,...,30000],...,[...] ]

采纳答案by Andras Deak

The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?

您问题的当前状态(可以随时更改):如何有效地从大型数组的大型数组中删除唯一元素?

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

IPython shell 中的运行时:

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Update: as @hpauljpointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

更新:正如@hpaulj在他的评论中指出的那样,我的虚拟示例是有偏见的,因为浮点随机数几乎肯定是唯一的。所以这里有一个更逼真的整数示例:

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

在这种情况下,输出列表的元素具有不同的长度,因为要删除实际的重复项。

回答by P. Camilleri

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

首先展平 ndarray 以获得一维数组,然后在其上应用 set() :

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x]to obtain a list of sets.

编辑:由于您似乎只想要一个集合数组,而不是整个数组的集合,那么您可以value = [set(v) for v in x]获取集合列表。

回答by hpaulj

A couple of earlier 'row-wise' unique questions:

几个较早的“按行”的独特问题:

vectorize numpy unique for subarrays

对子数组唯一的向量化 numpy

Numpy: Row Wise Unique elements

Numpy:行明智的独特元素

Count unique elements row wise in an ndarray

在ndarray中按行计算唯一元素

In a couple of these the count is more interesting than the actual unique values.

在其中的几个中,计数比实际的唯一值更有趣。

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

如果每行唯一值的数量不同,则结果不能是 (2d) 数组。这是一个很好的迹象,表明问题不能完全矢量化。您需要对行进行某种迭代。