Python 在 numpy 数组中查找模式的最有效方法

Question

提问by Nik

I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.

我有一个包含整数（正数或负数）的二维数组。每一行代表特定空间站点随时间变化的值，而每一列代表给定时间不同空间站点的值。

So if the array is like:

所以如果数组是这样的：

1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1

The result should be

结果应该是

1 3 2 2 2 1

Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.

请注意，当mode有多个值时，可以将任何一个（随机选择）设置为mode。

I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.

我可以一次迭代查找模式一种的列，但我希望 numpy 可能有一些内置函数来做到这一点。或者如果有一个技巧可以在不循环的情况下有效地找到它。

Answer 1

采纳答案by fgb

Check scipy.stats.mode()(inspired by @tom10's comment):

检查scipy.stats.mode()（灵感来自@tom10 的评论）：

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
              [5, 2, 2, 1, 4, 1],
              [3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Output:

输出：

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:

如您所见，它同时返回模式和计数。您可以通过m[0]以下方式直接选择模式：

print(m[0])

Output:

输出：

[[1 3 2 2 1 1]]

Answer 2

回答by Devin Cairns

Update

更新

The scipy.stats.modefunction has been significantly optimized since this post, and would be the recommended method

scipy.stats.mode自这篇文章以来，该功能已得到显着优化，并将成为推荐的方法

Old answer

旧答案

This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincountis handy, along with numpy.uniquewith the return_countsarg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:

这是一个棘手的问题，因为沿轴计算模式的方法并不多。对于一维数组，解决方案很简单， where和arg asnumpy.bincount很方便。我看到的最常见的 n 维函数是 scipy.stats.mode，尽管它慢得令人望而却步——尤其是对于具有许多唯一值的大型数组。作为解决方案，我开发了这个函数，并大量使用它：numpy.uniquereturn_countsTrue

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Result:

结果：

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
                         [5, 2, 2, 1, 4, 1],
                         [3, 3, 2, 2, 1, 1]])

In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

Some benchmarks:

一些基准：

In [4]: import scipy.stats

In [5]: a = numpy.random.randint(1,10,(1000,1000))

In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop

In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop

In [8]: a = numpy.random.randint(1,500,(1000,1000))

In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop

In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop

In [11]: a = numpy.random.random((200,200))

In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop

In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop

EDIT: Provided more of a background and modified the approach to be more memory-efficient

编辑：提供更多背景信息并修改方法以提高内存效率

Answer 3

回答by Lean Bravo

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

扩展此方法，应用于查找数据的众数，其中您可能需要实际数组的索引来查看值与分布中心的距离。

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.

请记住，当 len(np.argmax(counts)) > 1 时丢弃该模式，还要验证它是否确实代表了数据的中心分布，您可以检查它是否落在您的标准偏差区间内。

Answer 4

回答by Ali_Ayub

I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.

我认为一个非常简单的方法是使用 Counter 类。然后，您可以使用此处提到的 Counter 实例的 most_common() 函数。

For 1-d arrays:

对于一维数组：

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

For multiple dimensional arrays (little difference):

对于多维数组（差别不大）：

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 
nparr = nparr.reshape((10,2,5))     #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1)  # just use .flatten() method

# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

This may or may not be an efficient implementation, but it is convenient.

这可能是也可能不是一个有效的实现，但它很方便。

Answer 5

回答by Def_Os

A neat solution that onlyuses numpy(not scipynor the Counterclass):

一个只使用numpy（不是scipy也不是Counter类）的简洁解决方案：

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])

np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

array([1, 3, 2, 2, 1, 1])

数组([1, 3, 2, 2, 1, 1])

Answer 6

回答by Zeliha Bektas

from collections import Counter

n = int(input())
data = sorted([int(i) for i in input().split()])

sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]

print(Mean)

The Counter(data)counts the frequency and returns a defaultdict. sorted(Counter(data).items())sorts using the keys, not the frequency. Finally, need to sorted the frequency using another sorted with key = lambda x: x[1]. The reverse tells Python to sort the frequency from the largest to the smallest.

该Counter(data)计数频率，并返回一个defaultdict。sorted(Counter(data).items())使用键排序，而不是频率。最后，需要使用另一个 sorted with 对频率进行排序key = lambda x: x[1]。反过来告诉 Python 将频率从最大到最小排序。

Answer 7

回答by Ashutosh K Singh

simplest way in Python to get the mode of an list or array a

Python 中获取列表或数组模式的最简单方法

   import statistics
   print("mode = "+str(statistics.(mode(a)))

That's it

就是这样

Answer 8

回答by poisonedivy

If you want to use numpy only:

如果您只想使用 numpy：

x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)

gives

给

(array([-1,  1,  2,  3]), array([1, 1, 1, 2]))

And extract it:

并提取它：

index = np.argmax(counts)
return vals[index]

Python 在 numpy 数组中查找模式的最有效方法

提问by Nik

采纳答案by fgb

回答by Devin Cairns

回答by Lean Bravo

回答by Ali_Ayub

回答by Def_Os

回答by Zeliha Bektas

回答by Ashutosh K Singh

回答by poisonedivy

相关推荐

最近更新

标签

Python 在 numpy 数组中查找模式的最有效方法

提问by Nik

采纳答案by fgb

回答by Devin Cairns

回答by Lean Bravo

回答by Ali_Ayub

回答by Def_Os

回答by Zeliha Bektas

回答by Ashutosh K Singh

回答by poisonedivy

相关推荐

Python 我可以导入 CSV 文件并自动推断分隔符吗？

Python 从字典中绘制直方图

Python 比较两个列表并只打印差异？（异或两个列表）

如何在Python中添加变量？

相关推荐

最近更新

标签