Python 使用 itertools.groupby 性能的 Numpy 分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4651683/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Numpy grouping using itertools.groupby performance
提问by Donny
I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. Can anyone else better the benchmark using Python and preferably Numpy?
我有许多包含重复项的大型(> 35,000,000)整数列表。我需要对列表中的每个整数进行计数。以下代码有效,但似乎很慢。其他任何人都可以使用 Python,最好是 Numpy 来改进基准测试吗?
def group():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,len(list(g))) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
which returns:
返回:
$ python bench.py
111.377498865
Cheers!
干杯!
Editbased on responses:
根据回复编辑:
def group_original():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,len(list(g))) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
def group_gnibbler():
import numpy as np
from itertools import groupby
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
groups = ((k,sum(1 for i in g)) for k,g in groupby(values))
index = np.fromiter(groups,dtype='u4,u2')
def group_christophe():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
counts=values.searchsorted(values, side='right') - values.searchsorted(values, side='left')
index = np.zeros(len(values),dtype='u4,u2')
index['f0']=values
index['f1']=counts
#Erroneous result!
def group_paul():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
diff = np.concatenate(([1],np.diff(values)))
idx = np.concatenate((np.where(diff)[0],[len(values)]))
index = np.empty(len(idx)-1,dtype='u4,u2')
index['f0']=values[idx[:-1]]
index['f1']=np.diff(idx)
if __name__=='__main__':
from timeit import Timer
timings=[
("group_original","Original"),
("group_gnibbler","Gnibbler"),
("group_christophe","Christophe"),
("group_paul","Paul"),
]
for method,title in timings:
t = Timer("%s()"%method,"from __main__ import %s"%method)
print "%s: %s secs"%(title,t.timeit(number=1))
which returns:
返回:
$ python bench.py
Original: 113.385262966 secs
Gnibbler: 71.7464978695 secs
Christophe: 27.1690568924 secs
Paul: 9.06268405914 secs
Although Christophe gives incorrect results currently
尽管 Christophe 目前给出了不正确的结果
采纳答案by Paul
i get a 3x improvement doing something like this:
做这样的事情我得到了 3 倍的改进:
def group():
import numpy as np
values = np.array(np.random.randint(0,3298,size=35000000),dtype='u4')
values.sort()
dif = np.ones(values.shape,values.dtype)
dif[1:] = np.diff(values)
idx = np.where(dif>0)
vals = values[idx]
count = np.diff(idx)
回答by Rafa? Dowgird
Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int)for keeping counts of the integers and just iterate over the array once:
排序是 theta(NlogN),我会选择 Python 的哈希表实现提供的摊销 O(N)。仅defaultdict(int)用于保持整数的计数并仅迭代一次数组:
counts = collections.defaultdict(int)
for v in values:
counts[v] += 1
This is theoreticallyfaster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.
这理论上更快,不幸的是我现在无法检查。分配额外的内存可能会使它实际上比您的解决方案慢,这是就地的。
Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).
编辑:如果您需要节省内存,请尝试基数排序,它在整数上比快速排序快得多(我相信这是 numpy 使用的)。
回答by ChristopheD
This is a numpy solution:
这是一个麻木的解决方案:
def group():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
# we sort in place
values.sort()
# when sorted the number of occurences for a unique element is the index of
# the first occurence when searching from the right - the index of the first
# occurence when searching from the left.
#
# np.dstack() is the numpy equivalent to Python's zip()
l = np.dstack((values, values.searchsorted(values, side='right') - \
values.searchsorted(values, side='left')))
index = np.fromiter(l, dtype='u4,u2')
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
Runs in about 25seconds on my machine compared to about 96for your initial solution (which is a nice improvement).
在我的机器上运行大约25秒,而您的初始解决方案大约需要96秒(这是一个很好的改进)。
There might be still room for improvement, I don't use numpy that often.
可能还有改进的空间,我不经常使用 numpy。
Edit: added some comments in code.
编辑:在代码中添加了一些注释。
回答by John La Rooy
Replacing len(list(g))with sum(1 for i in g)gives a 2x speedup
更换len(list(g))用sum(1 for i in g)就是2倍加速
回答by Justin Peel
By request, here is a Cython version of this. I did two passes through the array. The first one finds out how many unique elements there are so that can my arrays for the unique values and counts of the appropriate size.
根据要求,这是一个 Cython 版本。我做了两次遍历数组。第一个找出有多少唯一元素,以便我的数组可以获取适当大小的唯一值和计数。
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
def dogroup():
cdef unsigned long tot = 1
cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)
cdef unsigned long i, ind, lastval
values.sort()
for i in xrange(1,len(values)):
if values[i] != values[i-1]:
tot += 1
cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)
cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)
vals[0] = values[0]
ind = 1
lastval = 0
for i in xrange(1,len(values)):
if values[i] != values[i-1]:
vals[ind] = values[i]
count[ind-1] = i - lastval
lastval = i
ind += 1
count[ind-1] = len(values) - lastval
The sorting is actually taking the most time here by far. Using the values array given in my code, the sorting is taking 4.75 seconds and the actual finding of the unique values and counts takes .67 seconds. With the pure Numpy code using Paul's code (but with the same form of the values array) with the fix I suggested in a comment, finding the unique values and counts takes 1.9 seconds (sorting still takes the same amount of time of course).
到目前为止,排序实际上在这里花费的时间最多。使用我的代码中给出的值数组,排序需要 4.75 秒,而实际查找唯一值和计数需要 0.67 秒。使用使用 Paul 代码的纯 Numpy 代码(但具有相同形式的值数组)以及我在评论中建议的修复,查找唯一值和计数需要 1.9 秒(当然排序仍然需要相同的时间)。
It makes sense for most of the time to be taken up by the sorting because it is O(N log N) and the counting is O(N). You can speed up the sort a little bit over Numpy's (which uses C's qsort if I remember correctly), but you have to really know what you are doing and it probably isn't worthwhile. Also, there might be some way to speed up my Cython code a little bit more, but it probably isn't worthwhile.
排序占用大部分时间是有意义的,因为它是 O(N log N) 并且计数是 O(N)。您可以比 Numpy 稍微加快排序速度(如果我没记错的话,它使用 C 的 qsort),但是您必须真正知道自己在做什么,这可能不值得。此外,可能有一些方法可以稍微加快我的 Cython 代码的速度,但这可能不值得。
回答by senderle
This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:
这是一个相当古老的线程,但我想我会提到对当前接受的解决方案有一个小的改进:
def group_by_edge():
import numpy as np
values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
values.sort()
edges = (values[1:] != values[:-1]).nonzero()[0] - 1
idx = np.concatenate(([0], edges, [len(values)]))
index = np.empty(len(idx) - 1, dtype= 'u4, u2')
index['f0'] = values[idx[:-1]]
index['f1'] = np.diff(idx)
This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diffapproach is a bit opaque at first glance.
这在我的机器上测试快了大约半秒;不是很大的改进,但值得。此外,我认为这里发生的事情更清楚;diff乍一看,两步法有点不透明。
回答by Michael
I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.
我想最明显但仍未提及的方法是,简单地使用collections.Counter. 它没有使用 groupby 构建大量临时使用的列表,它只是对整数进行计数。这是一个单线和 2 倍的加速,但仍然比纯 numpy 解决方案慢。
def group():
import sys
import numpy as np
from collections import Counter
values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
c = Counter(values)
if __name__=='__main__':
from timeit import Timer
t = Timer("group()","from __main__ import group")
print t.timeit(number=1)
I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.
与初始解决方案相比,我的机器从 136 秒加速到 62 秒。
回答by joeln
You could try the following (ab)use of scipy.sparse:
您可以尝试以下(ab)使用scipy.sparse:
from scipy import sparse
def sparse_bincount(values):
M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
M.sum_duplicates()
index = np.empty(len(M.indices),dtype='u4,u2')
index['f0'] = M.indices
index['f1']= M.data
return index
This is slower than the winning answer, perhaps because scipycurrently doesn't support unsigned as indices types...
这比获胜的答案要慢,可能是因为scipy目前不支持 unsigned 作为索引类型......
回答by Ali
More than 5 years have passed since Paul's answer was accepted. Interestingly,
the sort()is still the bottleneck in the accepted solution.
自从保罗的回答被接受以来,已经过去了 5 年多。有趣的sort()是,这仍然是公认解决方案的瓶颈。
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def group_paul():
5 1 99040 99040.0 2.4 import numpy as np
6 1 305651 305651.0 7.4 values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
7 1 2928204 2928204.0 71.3 values.sort()
8 1 78268 78268.0 1.9 diff = np.concatenate(([1],np.diff(values)))
9 1 215774 215774.0 5.3 idx = np.concatenate((np.where(diff)[0],[len(values)]))
10 1 95 95.0 0.0 index = np.empty(len(idx)-1,dtype='u4,u2')
11 1 386673 386673.0 9.4 index['f0'] = values[idx[:-1]]
12 1 91492 91492.0 2.2 index['f1'] = np.diff(idx)
The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.
接受的解决方案在我的机器上运行 4.0 秒,使用基数排序它下降到 1.7 秒。
Just by switching to radix sort, I get an overall 2.35x speedup.The radix sort is more than 4x faster than quicksort in this case.
只需切换到基数排序,我就获得了 2.35 倍的整体加速。在这种情况下,基数排序比快速排序快 4 倍以上。
See How to sort an array of integers faster than quicksort?that was motivated by your question.
请参阅如何比快速排序更快地对整数数组进行排序?那是由你的问题激发的。
回答by Gabriel_F
In the latest version of numpy, we have this.
在最新版本的 numpy 中,我们有这个。
import numpy as np
frequency = np.unique(values, return_counts=True)

