Python 在 numpy 数组中查找模式的最有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16330831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Most efficient way to find mode in numpy array
提问by Nik
I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.
我有一个包含整数(正数或负数)的二维数组。每一行代表特定空间站点随时间变化的值,而每一列代表给定时间不同空间站点的值。
So if the array is like:
所以如果数组是这样的:
1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1
The result should be
结果应该是
1 3 2 2 2 1
Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.
请注意,当mode有多个值时,可以将任何一个(随机选择)设置为mode。
I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.
我可以一次迭代查找模式一种的列,但我希望 numpy 可能有一些内置函数来做到这一点。或者如果有一个技巧可以在不循环的情况下有效地找到它。
采纳答案by fgb
Check scipy.stats.mode()(inspired by @tom10's comment):
检查scipy.stats.mode()(灵感来自@tom10 的评论):
import numpy as np
from scipy import stats
a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
m = stats.mode(a)
print(m)
Output:
输出:
ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))
As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:
如您所见,它同时返回模式和计数。您可以通过m[0]以下方式直接选择模式:
print(m[0])
Output:
输出:
[[1 3 2 2 1 1]]
回答by Devin Cairns
Update
更新
The scipy.stats.modefunction has been significantly optimized since this post, and would be the recommended method
scipy.stats.mode自这篇文章以来,该功能已得到显着优化,并将成为推荐的方法
Old answer
旧答案
This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincountis handy, along with numpy.uniquewith the return_countsarg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:
这是一个棘手的问题,因为沿轴计算模式的方法并不多。对于一维数组,解决方案很简单, where和arg asnumpy.bincount很方便。我看到的最常见的 n 维函数是 scipy.stats.mode,尽管它慢得令人望而却步——尤其是对于具有许多唯一值的大型数组。作为解决方案,我开发了这个函数,并大量使用它:numpy.uniquereturn_countsTrue
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]
Result:
结果:
In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))
Some benchmarks:
一些基准:
In [4]: import scipy.stats
In [5]: a = numpy.random.randint(1,10,(1000,1000))
In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop
In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop
In [8]: a = numpy.random.randint(1,500,(1000,1000))
In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop
In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop
In [11]: a = numpy.random.random((200,200))
In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop
In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop
EDIT: Provided more of a background and modified the approach to be more memory-efficient
编辑:提供更多背景信息并修改方法以提高内存效率
回答by Lean Bravo
Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
扩展此方法,应用于查找数据的众数,其中您可能需要实际数组的索引来查看值与分布中心的距离。
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.
请记住,当 len(np.argmax(counts)) > 1 时丢弃该模式,还要验证它是否确实代表了数据的中心分布,您可以检查它是否落在您的标准偏差区间内。
回答by Ali_Ayub
I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.
我认为一个非常简单的方法是使用 Counter 类。然后,您可以使用此处提到的 Counter 实例的 most_common() 函数。
For 1-d arrays:
对于一维数组:
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
For multiple dimensional arrays (little difference):
对于多维数组(差别不大):
import numpy as np
from collections import Counter
nparr = np.arange(10)
nparr[2] = 6
nparr[3] = 6
nparr = nparr.reshape((10,2,5)) #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1) # just use .flatten() method
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])
This may or may not be an efficient implementation, but it is convenient.
这可能是也可能不是一个有效的实现,但它很方便。
回答by Def_Os
A neat solution that onlyuses numpy(not scipynor the Counterclass):
一个只使用numpy(不是scipy也不是Counter类)的简洁解决方案:
A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])
np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)
array([1, 3, 2, 2, 1, 1])
数组([1, 3, 2, 2, 1, 1])
回答by Zeliha Bektas
from collections import Counter
n = int(input())
data = sorted([int(i) for i in input().split()])
sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0]
print(Mean)
The Counter(data)counts the frequency and returns a defaultdict. sorted(Counter(data).items())sorts using the keys, not the frequency. Finally, need to sorted the frequency using another sorted with key = lambda x: x[1]. The reverse tells Python to sort the frequency from the largest to the smallest.
该Counter(data)计数频率,并返回一个defaultdict。sorted(Counter(data).items())使用键排序,而不是频率。最后,需要使用另一个 sorted with 对频率进行排序key = lambda x: x[1]。反过来告诉 Python 将频率从最大到最小排序。
回答by Ashutosh K Singh
simplest way in Python to get the mode of an list or array a
Python 中获取列表或数组模式的最简单方法
import statistics
print("mode = "+str(statistics.(mode(a)))
That's it
就是这样
回答by poisonedivy
If you want to use numpy only:
如果您只想使用 numpy:
x = [-1, 2, 1, 3, 3]
vals,counts = np.unique(x, return_counts=True)
gives
给
(array([-1, 1, 2, 3]), array([1, 1, 1, 2]))
And extract it:
并提取它:
index = np.argmax(counts)
return vals[index]

