Python 使用 strides 实现高效的移动平均滤波器

Question

提问by Benjamin

I recently learned about stridesin the answer to this post, and was wondering how I could use them to compute a moving average filter more efficiently than what I proposed in this post(using convolution filters).

我最近在这篇文章的答案中了解到了strides，并想知道如何使用它们来计算比我在这篇文章中提出的（使用卷积滤波器）更有效的移动平均滤波器。

This is what I have so far. It takes a view of the original array then rolls it by the necessary amount and sums the kernel values to compute the average. I am aware that the edges are not handled correctly, but I can take care of that afterward... Is there a better and faster way? The objective is to filter large floating point arrays up to 5000x5000 x 16 layers in size, a task that scipy.ndimage.filters.convolveis fairly slow at.

这是我到目前为止。它查看原始数组，然后按必要的数量滚动它，并对内核值求和以计算平均值。我知道边缘处理不正确，但我可以在之后处理这个问题......有没有更好更快的方法？目标是过滤高达 5000x5000 x 16 层的大型浮点数组，这scipy.ndimage.filters.convolve是一项相当慢的任务。

Note that I am looking for 8-neighbour connectivity, that is a 3x3 filter takes the average of 9 pixels (8 around the focal pixel) and assigns that value to the pixel in the new image.

请注意，我正在寻找 8 邻域连接，即 3x3 过滤器取 9 个像素（焦点像素周围 8 个）的平均值，并将该值分配给新图像中的像素。

import numpy, scipy

filtsize = 3
a = numpy.arange(100).reshape((10,10))
b = numpy.lib.stride_tricks.as_strided(a, shape=(a.size,filtsize), strides=(a.itemsize, a.itemsize))
for i in range(0, filtsize-1):
    if i > 0:
        b += numpy.roll(b, -(pow(filtsize,2)+1)*i, 0)
filtered = (numpy.sum(b, 1) / pow(filtsize,2)).reshape((a.shape[0],a.shape[1]))
scipy.misc.imsave("average.jpg", filtered)

EDIT Clarification on how I see this working:

编辑澄清我如何看待这个工作：

Current code:

当前代码：

use stride_tricks to generate an array like [[0,1,2],[1,2,3],[2,3,4]...] which corresponds to the top row of the filter kernel.
Roll along the vertical axis to get the middle row of the kernel [[10,11,12],[11,12,13],[13,14,15]...] and add it to the array I got in 1)
Repeat to get the bottom row of the kernel [[20,21,22],[21,22,23],[22,23,24]...]. At this point, I take the sum of each row and divide it by the number of elements in the filter, giving me the average for each pixel, (shifted by 1 row and 1 col, and with some oddities around edges, but I can take care of that later).

使用 stride_tricks 生成一个像 [[0,1,2],[1,2,3],[2,3,4]...] 这样的数组，它对应于过滤器内核的顶行。
沿垂直轴滚动得到内核的中间行 [[10,11,12],[11,12,13],[13,14,15]...] 并将其添加到我输入的数组中1)
重复得到内核的底行 [[20,21,22],[21,22,23],[22,23,24]...]。此时，我将每行的总和除以过滤器中的元素数，得出每个像素的平均值（移动 1 行和 1 列，边缘周围有一些奇怪的东西，但我可以以后注意这个）。

What I was hoping for is a better use of stride_tricks to get the 9 values or the sum of the kernel elements directly, for the entire array, or that someone can convince me of another more efficient method...

我希望的是更好地使用 stride_tricks 来直接获取整个数组的 9 个值或内核元素的总和，或者有人可以说服我采用另一种更有效的方法......

Answer 1

采纳答案by Joe Kington

For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)

对于它的价值，这里是你如何使用“花哨的”跨步技巧来做到这一点。我本来打算昨天发布这个，但被实际工作分散了注意力！:)

@Paul & @eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.

@Paul 和 @eat 都使用其他各种方式实现了很好的实现。只是为了继续之前的问题，我想我会发布 N 维等价物。

You're not going to be able to significantly beat scipy.ndimagefunctions for >1D arrays, however. (scipy.ndimage.uniform_filtershould beat scipy.ndimage.convolve, though)

scipy.ndimage但是，对于 > 1D 数组，您将无法显着击败函数。（虽然scipy.ndimage.uniform_filter应该击败scipy.ndimage.convolve）

Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitudelarger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full98x98x3x3 array would!!)

此外，如果您试图获得一个多维移动窗口，那么每当您不经意地复制数组时，就会面临内存使用量激增的风险。虽然最初的“滚动”数组只是原始数组内存的视图，但复制数组的任何中间步骤都会生成一个比原始数组大几个数量级的副本（即假设您正在使用一个 100x100 的原始数组......它的视图（对于 (3,3) 的过滤器大小）将是 98x98x3x3 但使用与原始数组相同的内存。但是，任何副本都将使用完整的98x98x3x3 数组的内存量将！！）

Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axisof an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)

基本上，当您想要在ndarray的单个轴上矢量化移动窗口操作时，使用疯狂的跨步技巧非常有用。它使得计算诸如移动标准偏差之类的东西变得非常容易，而且开销很小。当您想开始沿多个轴执行此操作时，这是可能的，但通常最好使用更专业的功能。（例如scipy.ndimage，等）

At any rate, here's how you do it:

无论如何，这是你的方法：

import numpy as np

def rolling_window_lastaxis(a, window):
    """Directly taken from Erik Rigtorp's post to numpy-discussion.
    <http://www.mail-archive.com/[email protected]/msg29450.html>"""
    if window < 1:
       raise ValueError, "`window` must be at least 1."
    if window > a.shape[-1]:
       raise ValueError, "`window` is too long."
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def rolling_window(a, window):
    if not hasattr(window, '__iter__'):
        return rolling_window_lastaxis(a, window)
    for i, win in enumerate(window):
        if win > 1:
            a = a.swapaxes(i, -1)
            a = rolling_window_lastaxis(a, win)
            a = a.swapaxes(-2, i)
    return a

filtsize = (3, 3)
a = np.zeros((10,10), dtype=np.float)
a[5:7,5] = 1

b = rolling_window(a, filtsize)
blurred = b.mean(axis=-1).mean(axis=-1)

So what we get when we do b = rolling_window(a, filtsize)is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3)on a 4-dimensional array would give us a 6 dimensional view).

所以我们在做的时候得到的b = rolling_window(a, filtsize)是一个 8x8x3x3 的数组，这实际上是对与原始 10x10 数组相同的内存的一个视图。我们可以很容易地沿不同的轴使用不同的过滤器大小，或者仅沿 N 维数组的选定轴进行操作（即filtsize = (0,3,0,3)，在 4 维数组上会给我们一个 6 维视图）。

We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.

然后，我们可以将任意函数重复应用于最后一个轴，以有效计算移动窗口中的事物。

However, because we're storing temporary arrays that are much bigger than our original array on each step of mean(or stdor whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.

然而，因为我们在每一步mean（std或其他）中存储的临时数组比我们的原始数组大得多，所以这根本不是内存效率！它也不会非常快。

The equivalent for ndimageis just:

等价于ndimage：

blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)

This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be veryfast. Striding tricks are a good way to apply a function to a moving window along oneaxis, but they're not a good way to do it along multiple axes, usually....

这将处理各种边界条件，在不需要数组的临时副本的情况下就地进行“模糊”，并且速度非常快。跨步技巧是将函数应用于沿一个轴的移动窗口的好方法，但它们不是沿多个轴执行此操作的好方法，通常......

Just my $0.02, at any rate...

无论如何，只要我的 0.02 美元...

Answer 2

回答by Paul

One thing I am confident needs to be fixed is your view array b.

我确信需要修复的一件事是您的视图数组b。

It has a few items from unallocated memory, so you'll get crashes.

它有一些来自未分配内存的项目，所以你会崩溃。

Given your new description of your algorithm, the first thing that needs fixing is the fact that you are striding outside the allocation of a:

鉴于您对算法的新描述，需要修复的第一件事是您正在超出的分配范围a：

bshape = (a.size-filtsize+1, filtsize)
bstrides = (a.itemsize, a.itemsize)
b = numpy.lib.stride_tricks.as_strided(a, shape=bshape, strides=bstrides)

Update

更新

Because I'm still not quite grasping the method and there seems to be simpler ways to solve the problem, I'm just going to put this here:

因为我还没有完全掌握方法，而且似乎有更简单的方法可以解决问题，所以我只想把它放在这里：

A = numpy.arange(100).reshape((10,10))

shifts = [(-1,-1),(-1,0),(-1,1),(0,-1),(0,1),(1,-1),(1,0),(1,1)]
B = A[1:-1, 1:-1].copy()
for dx,dy in shifts:
    xstop = -1+dx or None
    ystop = -1+dy or None
    B += A[1+dx:xstop, 1+dy:ystop]
B /= 9

...which just seems like the straightforward approach. The only extraneous operation is that it has allocate and populate Bonly once. All the addition, division and indexing has to be done regardless. If you are doing 16 bands, you still only need to allocate Bonce if your intent is to save an image. Even if this is no help, it might clarify why I don't understand the problem, or at least serve as a benchmark to time the speedups of other methods. This runs in 2.6 sec on my laptop on a 5k x 5k array of float64's, 0.5 of which is the creation of B

...这似乎是一种简单的方法。唯一无关的操作是它只分配和填充B一次。不管怎样，所有的加法、除法和索引都必须完成。如果你正在做 16 个波段，B如果你的意图是保存图像，你仍然只需要分配一次。即使这没有帮助，它也可能澄清为什么我不理解这个问题，或者至少可以作为衡量其他方法加速时间的基准。这在我的笔记本电脑上以 5k x 5k 的 float64 数组运行 2.6 秒，其中 0.5 是创建B

Answer 3

回答by eat

Lets see:

让我们来看看：

It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.

你的问题不是很清楚，但我现在假设你会想显着提高这种平均水平。

import numpy as np
from numpy.lib import stride_tricks as st

def mf(A, k_shape= (3, 3)):
    m= A.shape[0]- 2
    n= A.shape[1]- 2
    strides= A.strides+ A.strides
    new_shape= (m, n, k_shape[0], k_shape[1])
    A= st.as_strided(A, shape= new_shape, strides= strides)
    return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)

if __name__ == '__main__':
    A= np.arange(100).reshape((10, 10))
    print mf(A)

Now, what kind of performance improvements you would actually expect?

现在，您实际上期望什么样的性能改进？

Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).

更新：
首先，警告：当前状态的代码无法正确适应“内核”形状。然而，这不是我现在主要关心的问题（无论如何，如何正确适应的想法已经存在）。

I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.

我刚刚直观地选择了 4D A 的新形状，对我来说，考虑以原始 2D A 的每个网格位置为中心的 2D“内核”中心真的很有意义。

But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).

但这种 4D 造型实际上可能不是“最好的”。我认为这里真正的问题是求和的性能。为了充分利用您的机器缓存架构，您应该能够找到（4D A 的）“最佳顺序”。但是，对于与您的机器缓存“合作”的“小型”阵列和那些与您的机器缓存“合作”的大型阵列，该顺序可能不同（至少不是那么简单的方式）。

Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.

更新 2：
这是mf. 显然，最好先重塑为 3D 数组，然后再进行求和，而不是仅进行点积（这具有所有优点，该内核可以是任意的）。然而，它仍然比 Pauls 更新的函数慢 3 倍（在我的机器上）。

def mf(A):
    k_shape= (3, 3)
    k= np.prod(k_shape)
    m= A.shape[0]- 2
    n= A.shape[1]- 2
    strides= A.strides* 2
    new_shape= (m, n)+ k_shape
    A= st.as_strided(A, shape= new_shape, strides= strides)
    w= np.ones(k)/ k
    return np.dot(A.reshape((m, n, -1)), w)

Answer 4

回答by Jonas

I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.

我对 Python 不够熟悉，无法为此编写代码，但加速卷积的两种最佳方法是分离滤波器或使用傅立叶变换。

Separated filter: Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3)= ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.

分离过滤器：卷积是 O(M*N)，其中 M 和 N 分别是图像和过滤器中的像素数。由于使用 3×3 内核进行平均过滤相当于先使用 3×1 内核过滤，然后使用 1×3 内核进行过滤，因此您可以(3+3)/(3*3)通过使用两个 1- 的连续卷积获得= ~30% 的速度提升d 内核（随着内核变大，这显然变得更好）。当然，您仍然可以在这里使用跨步技巧。

Fourier Transform: conv(A,B)is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where Ais your image and Bis your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A)with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad Ato odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).

傅立叶变换：conv(A,B)相当于ifft(fft(A)*fft(B))，即直接空间中的卷积变成傅立叶空间中的乘法，其中A是您的图像，B是您的过滤器。由于傅立叶变换的（逐元素）乘法要求 A 和 B 的大小相同，因此 B 是一个数组，size(A)您的内核位于图像的正中心，其他地方都是零。要将 3×3 内核放置在数组的中心，您可能需要填充A到奇数大小。根据傅里叶变换的实现，这可能比卷积快很多（如果多次应用相同的过滤器，则可以预先计算fft(B)，另外节省 30% 的计算时间）。

Python 使用 strides 实现高效的移动平均滤波器

提问by Benjamin

采纳答案by Joe Kington

回答by Paul

回答by eat

回答by Jonas

相关推荐

最近更新

标签

Python 使用 strides 实现高效的移动平均滤波器

提问by Benjamin

采纳答案by Joe Kington

回答by Paul

回答by eat

回答by Jonas

相关推荐

Python/matplotlib mplot3d-如何设置 z 轴的最大值？

Python中的Haversine公式（两个GPS点之间的轴承和距离）

Python 如何为元组添加值？

Python 如何高效解析定宽文件？

相关推荐

最近更新

标签