C++ 为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多？

Question

提问by Luchian Grigore

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^nis slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

在对不同大小的方阵进行了一些实验后，出现了一个模式。总是，转置 size 矩阵2^n比转置size 矩阵慢2^n+1。对于较小的值n，差异不大。

Big differences occur however over a value of 512. (at least for me)

然而，在 512 的值上会出现很大的差异。（至少对我而言）

Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.

免责声明：我知道由于元素的双重交换，该函数实际上并没有转置矩阵，但这没有区别。

Follows the code:

遵循以下代码：

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

Changing MATSIZElets us alter the size (duh!). I posted two versions on ideone:

改变MATSIZE让我们改变大小（废话！）。我在ideone上发布了两个版本：

size 512- average 2.46 ms- http://ideone.com/1PV7m
size 513- average 0.75 ms- http://ideone.com/NShpo

大小 512- 平均2.46 毫秒- http://ideone.com/1PV7m
大小 513- 平均0.75 毫秒- http://ideone.com/NShpo

In my environment (MSVS 2010, full optimizations), the difference is similar :

在我的环境中（MSVS 2010，完全优化），区别是相似的：

size 512- average 2.19 ms
size 513- average 0.57 ms

大小 512- 平均2.19 毫秒
大小 513- 平均0.57 毫秒

Why is this happening?

为什么会这样？

Answer 1

采纳答案by Luchian Grigore

The explanation comes from Agner Fog in Optimizing software in C++and it reduces to how data is accessed and stored in the cache.

解释来自 Agner Fog 在Optimizing software in C++ 中，它简化了如何访问数据并将其存储在缓存中。

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

有关术语和详细信息，请参阅有关缓存的wiki 条目，我将在此处缩小范围。

A cache is organized in setsand lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

缓存按集合和行组织。一次只使用一个集合，其中包含的任何行都可以使用。一行可以镜像的内存乘以行数就是缓存大小。

For a particular memory address, we can calculate which set should mirror it with the formula:

对于特定的内存地址，我们可以使用以下公式计算哪个集合应该镜像它：

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

这种公式在理想情况下给出了跨集合的均匀分布，因为每个内存地址都有可能被读取（我说理想情况下）。

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

很明显，可能会发生重叠。如果缓存未命中，则在缓存中读取内存并替换旧值。请记住，每组都有许多行，其中最近最少使用的行会被新读取的内存覆盖。

I'll try to somewhat follow the example from Agner:

我将尝试在某种程度上遵循 Agner 的示例：

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800apart. This is the critical stride(again, for this example).

假设每组有 4 行，每行占 64 个字节。我们首先尝试读取0x2710在 set 中的地址28。然后，我们也尝试读取地址0x2F00，0x3700，0x3F00和0x4700。所有这些都属于同一个集合。在阅读之前0x4700，集合中的所有行都会被占用。读取该内存会驱逐集合中的现有行，该行最初持有0x2710. 问题在于我们读取的地址（对于此示例）是0x800分开的。这是关键的一步（同样，对于这个例子）。

The critical stride can also be calculated:

临界步幅也可以计算：

criticalStride = numberOfSets * lineSize

Variables spaced criticalStrideor a multiple apart contend for the same cache lines.

间隔criticalStride或多个分开的变量争用相同的缓存行。

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

这是理论部分。接下来，解释（也是阿格纳，我密切关注它以避免犯错误）：

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

假设一个 64x64 的矩阵（记住，效果因缓存而异）具有 8kb 缓存，每组 4 行 * 行大小为 64 字节。每行可以容纳矩阵中的 8 个元素（64 位int）。

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

临界步长为 2048 字节，对应于矩阵的 4 行（在内存中是连续的）。

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

假设我们正在处理第 28 行。我们试图获取该行的元素并将它们与第 28 列的元素交换。该行的前 8 个元素构成一个缓存行，但它们将进入 8 个不同的缓存行在第 28 列。请记住，关键步幅是相隔 4 行（一列中的 4 个连续元素）。

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

当在列中到达元素 16 时（每组 4 个缓存行 & 相隔 4 行 = 故障），ex-0 元素将从缓存中逐出。当我们到达列的末尾时，所有先前的缓存行都将丢失，需要在访问下一个元素时重新加载（整行都被覆盖）。

Having a size that is not a multiple of the critical stride messes up this perfect scenariofor disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

大小不是关键步幅的倍数会扰乱这种完美的灾难场景，因为我们不再处理垂直方向上相距关键步幅的元素，因此缓存重新加载的次数大大减少。

Another disclaimer- I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

另一个免责声明- 我刚刚理解了解释并希望我能理解它，但我可能会弄错。无论如何，我正在等待Mysticial的回复（或确认）。:)

Answer 2

回答by Voo

Luchian gives an explanation of whythis behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.

Luchian 解释了为什么会发生这种行为，但我认为展示该问题的一种可能解决方案并同时展示一些有关缓存不经意算法的信息是个好主意。

Your algorithm basically does:

你的算法基本上是：

for (int i = 0; i < N; i++) 
   for (int j = 0; j < N; j++) 
        A[j][i] = A[i][j];

which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.

这对于现代 CPU 来说太可怕了。一种解决方案是了解有关缓存系统的详细信息并调整算法以避免这些问题。只要您知道这些细节，就可以很好地工作......不是特别便携。

Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithmsthat as the name says avoids being dependent on specific cache sizes [1]

我们能做得更好吗？是的，我们可以：解决这个问题的一般方法是缓存不经意算法，顾名思义，它避免依赖于特定的缓存大小 [1]

The solution would look like this:

解决方案如下所示：

void recursiveTranspose(int i0, int i1, int j0, int j1) {
    int di = i1 - i0, dj = j1 - j0;
    const int LEAFSIZE = 32; // well ok caching still affects this one here
    if (di >= dj && di > LEAFSIZE) {
        int im = (i0 + i1) / 2;
        recursiveTranspose(i0, im, j0, j1);
        recursiveTranspose(im, i1, j0, j1);
    } else if (dj > LEAFSIZE) {
        int jm = (j0 + j1) / 2;
        recursiveTranspose(i0, i1, j0, jm);
        recursiveTranspose(i0, i1, jm, j1);
    } else {
    for (int i = i0; i < i1; i++ )
        for (int j = j0; j < j1; j++ )
            mat[j][i] = mat[i][j];
    }
}

Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for MATSIZE 8192

稍微复杂一点，但是一个简短的测试显示在我的旧 e8400 上使用 VS2010 x64 版本非常有趣，测试代码为 MATSIZE 8192

int main() {
    LARGE_INTEGER start, end, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&start);
    recursiveTranspose(0, MATSIZE, 0, MATSIZE);
    QueryPerformanceCounter(&end);
    printf("recursive: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));

    QueryPerformanceCounter(&start);
    transpose();
    QueryPerformanceCounter(&end);
    printf("iterative: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
    return 0;
}

results: 
recursive: 480.58ms
iterative: 3678.46ms

Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms- that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])

编辑：关于大小的影响：虽然在某种程度上仍然很明显，但它不太明显，那是因为我们使用迭代解决方案作为叶节点而不是递归到 1（递归算法的通常优化）。如果我们设置LEAFSIZE = 1，缓存对我没有影响[ 8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms- 在误差范围内，波动在100ms范围内; 如果我们想要完全准确的值，这个“基准”不是我会太舒服的东西]）

[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.

[1] 这些东西的来源：好吧，如果你不能从与 Leiserson 和 co 一起工作的人那里得到演讲，我认为他们的论文是一个很好的起点。这些算法仍然很少被描述——CLR 有一个关于它们的脚注。这仍然是给人们带来惊喜的好方法。

Edit(note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:

编辑（注意：我不是发布此答案的人；我只是想添加此内容）：
这是上述代码的完整 C++ 版本：

template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
    size_t const rows, size_t const columns,
    size_t const r1 = 0, size_t const c1 = 0,
    size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
    size_t const leaf = 0x20)
{
    if (!~c2) { c2 = columns - c1; }
    if (!~r2) { r2 = rows - r1; }
    size_t const di = r2 - r1, dj = c2 - c1;
    if (di >= dj && di > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
        transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
    }
    else if (dj > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
        transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
    }
    else
    {
        for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
            i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
        {
            for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
                j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
            {
                output[j2 + i1] = input[i2 + j1];
            }
        }
    }
}

Answer 3

回答by Ruslan

As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).

作为对Luchian Grigore 的回答中解释的说明，以下是 64x64 和 65x65 矩阵两种情况下矩阵缓存存在的样子（有关数字的详细信息，请参阅上面的链接）。

Colors in the animations below mean the following:

以下动画中的颜色含义如下：

– not in cache,
– in cache,
– cache hit,
– just read from RAM,
– cache miss.

– 不在缓存中，
– 在缓存中，
– 缓存命中，
– 只需从 RAM 中读取，
– 缓存未命中。

The 64x64 case:

64x64 情况：

Notice how almost everyaccess to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:

请注意几乎每次访问新行都会导致缓存未命中。现在它如何查找正常情况，一个 65x65 矩阵：

Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.

在这里您可以看到，初始预热后的大多数访问都是缓存命中。这就是 CPU 缓存的一般工作方式。

_{The code that generated frames for the above animations can be seen here.}

_{可以在此处查看为上述动画生成帧的代码。}

C++ 为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多？

提问by Luchian Grigore

采纳答案by Luchian Grigore

回答by Voo

回答by Ruslan

相关推荐

最近更新

标签

C++ 为什么转置 512x512 的矩阵比转置 513x513 的矩阵慢得多？

提问by Luchian Grigore

采纳答案by Luchian Grigore

回答by Voo

回答by Ruslan

相关推荐

C++ 创建大量数字（10^9 大小）

C++ 获取 DLL 中导出函数的签名

有没有更好的方法在头文件中用 C++ 表达嵌套的命名空间

C++ 手工编码 GUI 与 Qt 设计器 GUI

相关推荐

最近更新

标签