Python 根据键翻译numpy数组中的每个元素

Question

提问by Akavall

I am trying to translate every element of a numpy.arrayaccording to a given key:

我正在尝试numpy.array根据给定的键翻译 a 的每个元素：

For example:

例如：

a = np.array([[1,2,3],
              [3,2,4]])

my_dict = {1:23, 2:34, 3:36, 4:45}

I want to get:

我想得到：

array([[ 23.,  34.,  36.],
       [ 36.,  34.,  45.]])

I can see how to do it with a loop:

我可以看到如何用循环来做到这一点：

def loop_translate(a, my_dict):
    new_a = np.empty(a.shape)
    for i,row in enumerate(a):
        new_a[i,:] = map(my_dict.get, row)
    return new_a

Is there a more efficient and/or pure numpy way?

有没有更有效和/或纯麻木的方式？

Edit:

编辑：

I timed it, and np.vectorizemethod proposed by DSM is considerably faster for larger arrays:

我对其进行了计时，np.vectorize对于较大的阵列，DSM 提出的方法要快得多：

In [13]: def loop_translate(a, my_dict):
   ....:     new_a = np.empty(a.shape)
   ....:     for i,row in enumerate(a):
   ....:         new_a[i,:] = map(my_dict.get, row)
   ....:     return new_a
   ....: 

In [14]: def vec_translate(a, my_dict):    
   ....:     return np.vectorize(my_dict.__getitem__)(a)
   ....: 

In [15]: a = np.random.randint(1,5, (4,5))

In [16]: a
Out[16]: 
array([[2, 4, 3, 1, 1],
       [2, 4, 3, 2, 4],
       [4, 2, 1, 3, 1],
       [2, 4, 3, 4, 1]])

In [17]: %timeit loop_translate(a, my_dict)
10000 loops, best of 3: 77.9 us per loop

In [18]: %timeit vec_translate(a, my_dict)
10000 loops, best of 3: 70.5 us per loop

In [19]: a = np.random.randint(1, 5, (500,500))

In [20]: %timeit loop_translate(a, my_dict)
1 loops, best of 3: 298 ms per loop

In [21]: %timeit vec_translate(a, my_dict)
10 loops, best of 3: 37.6 ms per loop

In [22]:  %timeit loop_translate(a, my_dict)

Answer 1

采纳答案by DSM

I don't know about efficient, but you could use np.vectorizeon the .getmethod of dictionaries:

我不知道效率，但你可以使用字典np.vectorize的.get方法：

>>> a = np.array([[1,2,3],
              [3,2,4]])
>>> my_dict = {1:23, 2:34, 3:36, 4:45}
>>> np.vectorize(my_dict.get)(a)
array([[23, 34, 36],
       [36, 34, 45]])

Answer 2

回答by John Vinyard

I think it'd be better to iterate over the dictionary, and set values in all the rows and columns "at once":

我认为最好遍历字典，并“一次”在所有行和列中设置值：

>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
       [3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> for k,v in d.iteritems():
...     a[a == k] = v
... 
>>> a
array([[11, 22, 33],
       [33, 22, 11]])

Edit:

编辑：

While it may not be as sexy as DSM's (really good) answerusing numpy.vectorize, my tests of all the proposed methods show that this approach (using @jamylak's suggestion) is actually a bit faster:

虽然它可能不会像性感为帝斯曼（真的很好）的答案用numpy.vectorize，我的所有建议的方法的实验表明，该方法（使用@ jamylak的建议）实际上是一个有点快：

from __future__ import division
import numpy as np
a = np.random.randint(1, 5, (500,500))
d = {1 : 11, 2 : 22, 3 : 33, 4 : 44}

def unique_translate(a,d):
    u,inv = np.unique(a,return_inverse = True)
    return np.array([d[x] for x in u])[inv].reshape(a.shape)

def vec_translate(a, d):    
    return np.vectorize(d.__getitem__)(a)

def loop_translate(a,d):
    n = np.ndarray(a.shape)
    for k in d:
        n[a == k] = d[k]
    return n

def orig_translate(a, d):
    new_a = np.empty(a.shape)
    for i,row in enumerate(a):
        new_a[i,:] = map(d.get, row)
    return new_a


if __name__ == '__main__':
    import timeit
    n_exec = 100
    print 'orig'
    print timeit.timeit("orig_translate(a,d)", 
                        setup="from __main__ import np,a,d,orig_translate",
                        number = n_exec) / n_exec
    print 'unique'
    print timeit.timeit("unique_translate(a,d)", 
                        setup="from __main__ import np,a,d,unique_translate",
                        number = n_exec) / n_exec
    print 'vec'
    print timeit.timeit("vec_translate(a,d)",
                        setup="from __main__ import np,a,d,vec_translate",
                        number = n_exec) / n_exec
    print 'loop'
    print timeit.timeit("loop_translate(a,d)",
                        setup="from __main__ import np,a,d,loop_translate",
                        number = n_exec) / n_exec

Outputs:

输出：

orig
0.222067718506
unique
0.0472617006302
vec
0.0357889199257
loop
0.0285375618935

Answer 3

回答by John Vinyard

Here's another approach, using numpy.unique:

这是另一种方法，使用numpy.unique：

>>> a = np.array([[1,2,3],[3,2,1]])
>>> a
array([[1, 2, 3],
       [3, 2, 1]])
>>> d = {1 : 11, 2 : 22, 3 : 33}
>>> u,inv = np.unique(a,return_inverse = True)
>>> np.array([d[x] for x in u])[inv].reshape(a.shape)
array([[11, 22, 33],
       [33, 22, 11]])

Answer 4

回答by Mikhail V

If you don't really have touse dictionary as substitution table, simple solution would be (for your example):

如果您真的不必使用字典作为替换表，那么简单的解决方案是（例如）：

a = numpy.array([your array])
my_dict = numpy.array([0, 23, 34, 36, 45])     # your dictionary as array

def Sub (myarr, table) :
    return table[myarr] 

values = Sub(a, my_dict)

This will work of course only if indexes of dcover all possible values of your a, in other words, only for awith usigned integers.

这当然只有在索引d覆盖您的所有可能值a时才有效a，换句话说，仅适用于带符号的整数。

Answer 5

回答by Eelco Hoogendoorn

The numpy_indexedpackage (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

该numpy_indexed包（免责声明：我是它的作者）提供了一个优雅和高效的矢量化解决方案，这种类型的问题：

import numpy_indexed as npi
remapped_a = npi.remap(a, list(my_dict.keys()), list(my_dict.values()))

The method implemented is similar to the approach mentioned by John Vinyard, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves.

实现的方法类似于 John Vinyard 提到的方法，但更通用。例如，数组的项不需要是整数，但可以是任何类型，甚至是 nd-subarrays 本身。

If you set the optional 'missing' kwarg to 'raise' (default is 'ignore'), performance will be slightly better, and you will get a KeyError if not all elements of 'a' are present in the keys.

如果您将可选的 'missing' kwarg 设置为 'raise'（默认为 'ignore'），性能会稍微好一些，如果 'a' 的所有元素都没有出现在键中，您将得到一个 KeyError。

Answer 6

回答by Maxim

Assuming your dict keys are positive integers, without huge gaps (similar to a range from 0 to N), you would be better off converting your translation dict to an array such that my_array[i] = my_dict[i], and using numpy indexing to do the translation.

假设您的 dict 键是正整数，没有巨大的差距（类似于从 0 到 N 的范围），您最好将翻译 dict 转换为这样的数组my_array[i] = my_dict[i]，然后使用 numpy 索引进行翻译。

A code using this approach is:

使用这种方法的代码是：

def direct_translate(a, d):
    src, values = d.keys(), d.values()
    d_array = np.arange(a.max() + 1)
    d_array[src] = values
    return d_array[a]

Testing with random arrays:

使用随机数组进行测试：

N = 10000
shape = (5000, 5000)
a = np.random.randint(N, size=shape)
my_dict = dict(zip(np.arange(N), np.random.randint(N, size=N)))

For these sizes I get around 140 msfor this approach. The np.get vectorization takes around 5.8 sand the unique_translatearound 8 s.

对于这些尺寸，我采用140 ms了这种方法。np.get 矢量化需要 around5.8 s和unique_translatearound 8 s。

Possible generalizations:

可能的概括：

If you have negative values to translate, you could shift the values in aand in the keys of the dictionary by a constant to map them back to positive integers:

如果要转换负值，则可以a通过常量将字典键中和键中的值移动以将它们映射回正整数：

def direct_translate(a, d): # handles negative source keys
    min_a = a.min()
    src, values = np.array(d.keys()) - min_a, d.values()
    d_array = np.arange(a.max() - min_a + 1)
    d_array[src] = values
    return d_array[a - min_a]

If the source keys have huge gaps, the initial array creation would waste memory. I would resort to cython to speed up that function.

如果源键有很大的间隙，初始数组创建会浪费内存。我会求助于 cython 来加速该功能。

Python 根据键翻译numpy数组中的每个元素

提问by Akavall

采纳答案by DSM

回答by John Vinyard

回答by John Vinyard

回答by Mikhail V

回答by Eelco Hoogendoorn

回答by Maxim

相关推荐

最近更新

标签

Python 根据键翻译numpy数组中的每个元素

提问by Akavall

采纳答案by DSM

回答by John Vinyard

回答by John Vinyard

回答by Mikhail V

回答by Eelco Hoogendoorn

回答by Maxim

相关推荐

Python 在 numpy.array 中查找唯一行

Python 结合 NLTK 和 scikit-learn 中的文本词干和标点删除

Python 读取流

Python 使用 NumPy 的数据类型的大小

相关推荐

最近更新

标签