Python 快速替换 numpy 数组中的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3403973/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:53:16  来源:igfitidea点击:

Fast replacement of values in a numpy array

pythonreplacenumpy

提问by dzhelil

I have a very large numpy array (containing up to a million elements) like the one below:

我有一个非常大的 numpy 数组(包含多达一百万个元素),如下所示:

[ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and a small dictionary map for replacing some of the elements in the above array

和一个小的字典映射,用于替换上面数组中的一些元素

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

我想根据上面的地图替换一些元素。numpy 数组真的很大,只有一小部分元素(作为字典中的键出现)会被相应的值替换。执行此操作的最快方法是什么?

采纳答案by kennytm

I believe there's even more efficient method, but for now, try

我相信有更有效的方法,但现在,请尝试

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v


Microbenchmark and test for correctness:

微基准测试和正确性测试:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

结果:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop

回答by Katriel

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

好吧,您需要通过一次theArray,如果每个元素在字典中,则替换它。

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]

回答by John La Rooy

for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])

回答by dzhelil

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->intdict, like below

假设值介于 0 和某个最大整数之间,则可以通过使用 numpy-array 作为int->intdict来实现快速替换,如下所示

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

首先在哪里

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

并替换为

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

我们获得

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]

回答by Speckinius Flecksis

Another more general way to achieve this is function vectorization:

另一种更通用的方法是函数向量化:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)

回答by Pietro Battiston

No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:

在数组上没有 python 循环的情况下仍然没有发布任何解决方案(除了 Celil 的一个,但它假设数字是“小”的),所以这里有一个替代方案:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way "idces" is created comes from here.

“idces”的创建方式来自这里

回答by Jean Lescut

I benchmarked some solutions, and the result is without appeal :

我对一些解决方案进行了基准测试,结果毫无吸引力:

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

我得到了以下结果:

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The "searchsort" method is almost a hundred timesfaster than the "for" loop, and about 3600 times fasterthan the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.

“searchsort”方法几乎比“for”循环快一百倍,比numpy bruteforce方法3600倍。列表理解方法也是代码简单性和速度之间的一个很好的权衡。

回答by caiohamamura

Pythonic way without the need for data to be integer, can be even strings:

Pythonic 方式不需要数据是整数,甚至可以是字符串:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]

回答by Eelco Hoogendoorn

The numpy_indexedpackage (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

numpy_indexed包(免责声明:我是它的作者)提供了一个优雅和高效的矢量化解决方案,这种类型的问题:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

实现的方法类似于 Jean Lescut 提到的基于搜索排序的方法,但更通用。例如,数组的项不需要是整数,而是可以是任何类型,甚至是 nd-subarrays 本身;但它应该达到同样的性能。

回答by Nils Werner

A fully vectorized solution using np.in1dand np.searchsorted:

使用np.in1d和 的完全矢量化解决方案np.searchsorted

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements