C++ 为什么我的程序在循环 8192 个元素时很慢？

Question

提问by Mysticial

Here is the extract from the program in question. The matrix img[][]has the size SIZE×SIZE, and is initialized at:

这是有关程序的摘录。矩阵img[][]的大小为 SIZE×SIZE，并初始化为：

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

然后，您创建一个矩阵res[][]，并将此处的每个字段设为 img 矩阵中围绕它的 9 个字段的平均值。为简单起见，边界保留为 0。

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

这就是程序的全部内容。为了完整起见，这里是之前的内容。后面没有代码。如您所见，这只是初始化。

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

基本上，当 SIZE 是 2048 的倍数时，这个程序很慢，例如执行时间：

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

编译器是 GCC。据我所知，这是因为内存管理，但我对这个主题并不太了解，这就是我在这里问的原因。

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

另外如何解决这个问题会很好，但是如果有人可以解释这些执行时间，我已经很高兴了。

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

我已经知道 malloc/free，但问题不在于使用的内存量，而只是执行时间，所以我不知道这会有什么帮助。

Answer 1

回答by Mysticial

The difference is caused by the same super-alignment issue from the following related questions:

差异是由以下相关问题的相同超级对齐问题引起的：

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Matrix multiplication: Small difference in matrix size, large difference in timings

But that's only because there's one other problem with the code.

但这只是因为代码还有另一个问题。

Starting from the original loop:

从原始循环开始：

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. They can be unrolled as follows:

首先注意两个内部循环是微不足道的。它们可以展开如下：

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in.

这样就留下了我们感兴趣的两个外部循环。

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

现在我们可以看到这个问题的问题是相同的：为什么在迭代二维数组时循环的顺序会影响性能？

You are iterating the matrix column-wise instead of row-wise.

您正在按列而不是按行迭代矩阵。

To solve this problem, you should interchange the two loops.

要解决此问题，您应该互换两个循环。

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

这完全消除了所有非顺序访问，因此您不再会在较大的 2 次幂上随机减速。

Core i7 920 @ 3.5 GHz

酷睿 i7 920 @ 3.5 GHz

Original code:

原始代码：

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops:

交换外环：

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

Answer 2

回答by bokan

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). When using GCC, there is no big difference between Mystical's version and my "optimized" code. So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). I leave the rest of my answer for reference.

以下测试是使用 Visual C++ 编译器完成的，因为它被默认的 Qt Creator 安装使用（我猜没有优化标志）。使用 GCC 时，Mystical 的版本和我的“优化”代码之间没有太大区别。所以结论是编译器优化比人类（最后是我）更好地关注微优化。我留下我的其余答案以供参考。

It's not efficient to process images this way. It's better to use single dimension arrays. Processing all pixels is the done in one loop. Random access to points could be done using:

以这种方式处理图像效率不高。最好使用一维数组。处理所有像素是在一个循环中完成的。可以使用以下方法随机访问点：

pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each.

在这种特殊情况下，最好水平计算和缓存三个像素组的总和，因为它们每个都使用了 3 次。

I've done some tests and I think it's worth sharing. Each result is an average of five tests.

我做了一些测试，我认为值得分享。每个结果是五次测试的平均值。

Original code by user1615209:

用户1615209的原始代码：

8193: 4392 ms
8192: 9570 ms

Mystical's version:

神秘的版本：

8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average. Two pass addressing with three pointers and only increments like this:

使用一维数组的两次传递：第一次传递水平和，第二次传递垂直和和平均值。使用三个指针进行两次寻址，并且仅像这样递增：

imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}

8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this:

使用一维数组进行两次传递并像这样寻址：

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}

8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache:

一次缓存水平总和仅向前一行，因此它们保留在缓存中：

// Horizontal sums for the first two lines
for(i=1;i<SIZE*2;i++){
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i<totalSize;i++){
    // Compute horizontal sum for next line
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
    // Final result
    resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}

8193: 599 ms
8192: 652 ms

Conclusion:

结论：

No benefits of using several pointers and just increments (I thought it would have been faster)
Caching horizontal sums is better than computing them several time.
Two pass is not three times faster, two times only.
It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result

使用多个指针和只是增量没有好处（我认为它会更快）
缓存水平总和比多次计算它们要好。
二通不是快三倍，只有两倍。
使用单次传递和缓存中间结果可以使速度提高 3.6 倍

I'm sure it's possible to do much better.

我相信有可能做得更好。

NOTEPlease, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. At the beginning it was just pseudo code. I was asked to do tests in the comments... Here is a completely refactored version with tests.

注意请注意，我写这个答案是针对一般性能问题，而不是 Mystical 的优秀答案中解释的缓存问题。一开始它只是伪代码。我被要求在评论中做测试......这是一个完全重构的带有测试的版本。

C++ 为什么我的程序在循环 8192 个元素时很慢？

提问by Mysticial

回答by Mysticial

回答by bokan

相关推荐

最近更新

标签

C++ 为什么我的程序在循环 8192 个元素时很慢？

提问by Mysticial

回答by Mysticial

回答by bokan

相关推荐

C++ 为什么我应该按照声明的顺序初始化成员变量？

在 C++ 中，我可以从结构派生一个类吗

C++ 我如何遍历 QListWidget 项目并处理每个项目？

C++ 如何在 Linux 中以编程方式检测 IP 地址更改？

相关推荐

最近更新

标签