Python 使用 itertools.groupby 性能的 Numpy 分组

Question

提问by Donny

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. Can anyone else better the benchmark using Python and preferably Numpy?

我有许多包含重复项的大型（> 35,000,000）整数列表。我需要对列表中的每个整数进行计数。以下代码有效，但似乎很慢。其他任何人都可以使用 Python，最好是 Numpy 来改进基准测试吗？

def group():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

which returns:

返回：

$ python bench.py 
111.377498865

Cheers!

干杯!

Editbased on responses:

根据回复编辑：

def group_original():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

def group_gnibbler():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,sum(1 for i in g)) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

def group_christophe():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    counts=values.searchsorted(values, side='right') - values.searchsorted(values, side='left')
    index = np.zeros(len(values),dtype='u4,u2')
    index['f0']=values
    index['f1']=counts
    #Erroneous result!

def group_paul():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    diff = np.concatenate(([1],np.diff(values)))
    idx = np.concatenate((np.where(diff)[0],[len(values)]))
    index = np.empty(len(idx)-1,dtype='u4,u2')
    index['f0']=values[idx[:-1]]
    index['f1']=np.diff(idx)

if __name__=='__main__':
    from timeit import Timer
    timings=[
                ("group_original","Original"),
                ("group_gnibbler","Gnibbler"),
                ("group_christophe","Christophe"),
                ("group_paul","Paul"),
            ]
    for method,title in timings:
        t = Timer("%s()"%method,"from __main__ import %s"%method)
        print "%s: %s secs"%(title,t.timeit(number=1))

which returns:

返回：

$ python bench.py 
Original: 113.385262966 secs
Gnibbler: 71.7464978695 secs
Christophe: 27.1690568924 secs
Paul: 9.06268405914 secs

Although Christophe gives incorrect results currently

尽管 Christophe 目前给出了不正确的结果

Answer 1

采纳答案by Paul

i get a 3x improvement doing something like this:

做这样的事情我得到了 3 倍的改进：

def group():
    import numpy as np
    values = np.array(np.random.randint(0,3298,size=35000000),dtype='u4')
    values.sort()
    dif = np.ones(values.shape,values.dtype)
    dif[1:] = np.diff(values)
    idx = np.where(dif>0)
    vals = values[idx]
    count = np.diff(idx)

Answer 2

回答by Rafa? Dowgird

Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int)for keeping counts of the integers and just iterate over the array once:

排序是 theta(NlogN)，我会选择 Python 的哈希表实现提供的摊销 O(N)。仅defaultdict(int)用于保持整数的计数并仅迭代一次数组：

counts = collections.defaultdict(int)
for v in values:
    counts[v] += 1

This is theoreticallyfaster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.

这理论上更快，不幸的是我现在无法检查。分配额外的内存可能会使它实际上比您的解决方案慢，这是就地的。

Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).

编辑：如果您需要节省内存，请尝试基数排序，它在整数上比快速排序快得多（我相信这是 numpy 使用的）。

Answer 3

回答by ChristopheD

This is a numpy solution:

这是一个麻木的解决方案：

def group():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')

    # we sort in place
    values.sort()

    # when sorted the number of occurences for a unique element is the index of 
    # the first occurence when searching from the right - the index of the first
    # occurence when searching from the left.
    #
    # np.dstack() is the numpy equivalent to Python's zip()

    l = np.dstack((values, values.searchsorted(values, side='right') - \
                   values.searchsorted(values, side='left')))

    index = np.fromiter(l, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

Runs in about 25seconds on my machine compared to about 96for your initial solution (which is a nice improvement).

在我的机器上运行大约25秒，而您的初始解决方案大约需要96秒（这是一个很好的改进）。

There might be still room for improvement, I don't use numpy that often.

可能还有改进的空间，我不经常使用 numpy。

Edit: added some comments in code.

编辑：在代码中添加了一些注释。

Answer 4

回答by John La Rooy

Replacing len(list(g))with sum(1 for i in g)gives a 2x speedup

更换len(list(g))用sum(1 for i in g)就是2倍加速

Answer 5

回答by Justin Peel

By request, here is a Cython version of this. I did two passes through the array. The first one finds out how many unique elements there are so that can my arrays for the unique values and counts of the appropriate size.

根据要求，这是一个 Cython 版本。我做了两次遍历数组。第一个找出有多少唯一元素，以便我的数组可以获取适当大小的唯一值和计数。

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
def dogroup():
    cdef unsigned long tot = 1
    cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)
    cdef unsigned long i, ind, lastval
    values.sort()
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            tot += 1
    cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)
    cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)
    vals[0] = values[0]
    ind = 1
    lastval = 0
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            vals[ind] = values[i]
            count[ind-1] = i - lastval
            lastval = i
            ind += 1
    count[ind-1] = len(values) - lastval

The sorting is actually taking the most time here by far. Using the values array given in my code, the sorting is taking 4.75 seconds and the actual finding of the unique values and counts takes .67 seconds. With the pure Numpy code using Paul's code (but with the same form of the values array) with the fix I suggested in a comment, finding the unique values and counts takes 1.9 seconds (sorting still takes the same amount of time of course).

到目前为止，排序实际上在这里花费的时间最多。使用我的代码中给出的值数组，排序需要 4.75 秒，而实际查找唯一值和计数需要 0.67 秒。使用使用 Paul 代码的纯 Numpy 代码（但具有相同形式的值数组）以及我在评论中建议的修复，查找唯一值和计数需要 1.9 秒（当然排序仍然需要相同的时间）。

It makes sense for most of the time to be taken up by the sorting because it is O(N log N) and the counting is O(N). You can speed up the sort a little bit over Numpy's (which uses C's qsort if I remember correctly), but you have to really know what you are doing and it probably isn't worthwhile. Also, there might be some way to speed up my Cython code a little bit more, but it probably isn't worthwhile.

排序占用大部分时间是有意义的，因为它是 O(N log N) 并且计数是 O(N)。您可以比 Numpy 稍微加快排序速度（如果我没记错的话，它使用 C 的 qsort），但是您必须真正知道自己在做什么，这可能不值得。此外，可能有一些方法可以稍微加快我的 Cython 代码的速度，但这可能不值得。

Answer 6

回答by senderle

This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:

这是一个相当古老的线程，但我想我会提到对当前接受的解决方案有一个小的改进：

def group_by_edge():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    edges = (values[1:] != values[:-1]).nonzero()[0] - 1
    idx = np.concatenate(([0], edges, [len(values)]))
    index = np.empty(len(idx) - 1, dtype= 'u4, u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)

This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diffapproach is a bit opaque at first glance.

这在我的机器上测试快了大约半秒；不是很大的改进，但值得。此外，我认为这里发生的事情更清楚；diff乍一看，两步法有点不透明。

Answer 7

回答by Michael

I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.

我想最明显但仍未提及的方法是，简单地使用collections.Counter. 它没有使用 groupby 构建大量临时使用的列表，它只是对整数进行计数。这是一个单线和 2 倍的加速，但仍然比纯 numpy 解决方案慢。

def group():
    import sys
    import numpy as np
    from collections import Counter
    values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
    c = Counter(values)

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.

与初始解决方案相比，我的机器从 136 秒加速到 62 秒。

Answer 8

回答by joeln

You could try the following (ab)use of scipy.sparse:

您可以尝试以下（ab）使用scipy.sparse：

from scipy import sparse
def sparse_bincount(values):
    M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
    M.sum_duplicates()
    index = np.empty(len(M.indices),dtype='u4,u2')
    index['f0'] = M.indices
    index['f1']= M.data
    return index

This is slower than the winning answer, perhaps because scipycurrently doesn't support unsigned as indices types...

这比获胜的答案要慢，可能是因为scipy目前不支持 unsigned 作为索引类型......

Answer 9

回答by Ali

More than 5 years have passed since Paul's answer was accepted. Interestingly, the sort()is still the bottleneck in the accepted solution.

自从保罗的回答被接受以来，已经过去了 5 年多。有趣的sort()是，这仍然是公认解决方案的瓶颈。

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def group_paul():
     5         1        99040  99040.0      2.4      import numpy as np
     6         1       305651 305651.0      7.4      values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
     7         1      2928204 2928204.0    71.3      values.sort()
     8         1        78268  78268.0      1.9      diff = np.concatenate(([1],np.diff(values)))
     9         1       215774 215774.0      5.3      idx = np.concatenate((np.where(diff)[0],[len(values)]))
    10         1           95     95.0      0.0      index = np.empty(len(idx)-1,dtype='u4,u2')
    11         1       386673 386673.0      9.4      index['f0'] = values[idx[:-1]]
    12         1        91492  91492.0      2.2      index['f1'] = np.diff(idx)

The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.

接受的解决方案在我的机器上运行 4.0 秒，使用基数排序它下降到 1.7 秒。

Just by switching to radix sort, I get an overall 2.35x speedup.The radix sort is more than 4x faster than quicksort in this case.

只需切换到基数排序，我就获得了 2.35 倍的整体加速。在这种情况下，基数排序比快速排序快 4 倍以上。

See How to sort an array of integers faster than quicksort?that was motivated by your question.

请参阅如何比快速排序更快地对整数数组进行排序？那是由你的问题激发的。

Answer 10

回答by Gabriel_F

In the latest version of numpy, we have this.

在最新版本的 numpy 中，我们有这个。

import numpy as np
frequency = np.unique(values, return_counts=True)

Python 使用 itertools.groupby 性能的 Numpy 分组

提问by Donny

采纳答案by Paul

回答by Rafa? Dowgird

回答by ChristopheD

回答by John La Rooy

回答by Justin Peel

回答by senderle

回答by Michael

回答by joeln

回答by Ali

回答by Gabriel_F

相关推荐

最近更新

标签

Python 使用 itertools.groupby 性能的 Numpy 分组

提问by Donny

采纳答案by Paul

回答by Rafa? Dowgird

回答by ChristopheD

回答by John La Rooy

回答by Justin Peel

回答by senderle

回答by Michael

回答by joeln

回答by Ali

回答by Gabriel_F

相关推荐

如何在原始 Python 字符串中包含引号

在 Python 中将列表定义为全局变量

Python 未解决的 PyDev 和 Eclipse 导入问题

Python PIL /JPEG 库：“解码器 jpeg 不可用”

相关推荐

最近更新

标签