C++ 什么是“缓存友好”代码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16699247/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 20:33:29  来源:igfitidea点击:

What is a "cache-friendly" code?

c++performancecachingmemorycpu-cache

提问by Noah Roth

What is the difference between "cache unfriendly code" and the "cache friendly" code?

缓存不友好代码”和“缓存友好”代码有什么区别?

How can I make sure I write cache-efficient code?

如何确保我编写了缓存高效的代码?

回答by Marc Claesen

Preliminaries

预赛

On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers (few hundred to maybe a thousand bytestotal). At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.

在现代计算机上,只有最低级别的内存结构(寄存器)可以在单个时钟周期内移动数据。然而,寄存器非常昂贵,大多数计算机内核只有不到几十个寄存器(总共几百到一千字节)。在内存频谱的另一端 ( DRAM),内存非常便宜(即便宜数百万倍)但在请求接收数据后需要数百个周期。为了弥合超快和昂贵以及超慢和便宜之间的差距是高速缓存,在降低速度和成本方面命名为 L1、L2、L3。这个想法是大多数正在执行的代码将经常访问一小组变量,而其余的(更大的一组变量)很少。如果处理器在 L1 缓存中找不到数据,则它会在 L2 缓存中查找。如果不存在,则为 L3 缓存,如果不存在,则为主内存。这些“未命中”中的每一个在时间上都是昂贵的。

(The analogy is cache memory is to system memory, as system memory is too hard disk storage. Hard disk storage is super cheap but very slow).

(类比是缓存内存对系统内存,因为系统内存太硬盘存储。硬盘存储超级便宜但很慢)。

Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.

缓存是减少延迟影响的主要方法之一。套用 Herb Sutter(参见下面的链接):增加带宽很容易,但我们无法摆脱延迟

Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/missusually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lotof time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.

数据总是通过内存层次结构检索(最小 == 最快到最慢)。一个高速缓存命中/缺失通常是指在CPU缓存中的最高等级的命中/缺失-由最高级别我的意思是最大的==最慢的。缓存命中率对性能至关重要,因为每次缓存未命中都会导致从 RAM 中获取数据(或更糟...),这需要大量时间(RAM 为数百个周期,HDD 为数千万个周期)。相比之下,从(最高级别)缓存读取数据通常只需要几个周期。

In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access.Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!

在现代计算机体系结构中,性能瓶颈是离开 CPU 芯片(例如访问 RAM 或更高)。随着时间的推移,这只会变得更糟。处理器频率的增加目前与提高性能不再相关。问题是内存访问。因此,CPU 中的硬件设计工作目前主要集中在优化缓存、预取、管道和并发性上。例如,现代 CPU 将大约 85% 的芯片用于缓存,高达 99% 用于存储/移动数据!

There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:

关于这个问题有很多话要说。以下是一些关于缓存、内存层次结构和正确编程的重要参考资料:

Main concepts for cache-friendly code

缓存友好代码的主要概念

A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?

缓存友好代码的一个非常重要的方面是关于局部性原则,其目标是将相关数据放在内存中,以实现高效缓存。就 CPU 缓存而言,了解缓存行以了解其工作原理很重要:缓存行如何工作?

The following particular aspects are of high importance to optimize caching:

以下特定方面对于优化缓存非常重要:

  1. Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
  2. Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache linesis important.
  1. 时间局部性:当访问给定的内存位置时,很可能在不久的将来再次访问相同的位置。理想情况下,该信息仍将在那时被缓存。
  2. 空间局部性:这是指将相关数据彼此靠近放置。缓存发生在许多级别,而不仅仅是在 CPU 中。例如,当您从 RAM 读取时,通常会获取比具体要求的更大的内存块,因为程序通常很快就会需要这些数据。HDD 缓存遵循相同的思路。特别是对于 CPU 缓存,缓存线的概念很重要。

Use appropriate c++containers

使用合适的C++容器

A simple example of cache-friendly versus cache-unfriendly is c++'s std::vectorversus std::list. Elements of a std::vectorare stored in contiguous memory, and as such accessing them is muchmore cache-friendly than accessing elements in a std::list, which stores its content all over the place. This is due to spatial locality.

的一个简单的例子缓存友好与缓存不友好是C ++std::vector对比std::list。的元素std::vector被存储在连续的存储器,因此访问它们是一个以上高速缓存友好比在访问元素std::list,其存储其内容所有的地方。这是由于空间局部性。

A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip(thanks to @Mohammad Ali Baydoun for the link!).

Bjarne Stroustrup 在这个 youtube 剪辑中给出了一个非常好的说明(感谢@Mohammad Ali Baydoun 提供链接!)。

Don't neglect the cache in data structure and algorithm design

在数据结构和算法设计中不要忽视缓存

Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking(Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).

只要有可能,尽量以允许最大程度使用缓存的方式调整数据结构和计算顺序。在这方面的一种常见技术是缓存阻塞(Archive.org 版本),这在高性能计算中极为重要(参见例如ATLAS)。

Know and exploit the implicit structure of data

了解并利用数据的隐式结构

Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:

另一个简单的例子,该领域的许多人有时会忘记,列优先(例如fortranmatlab)与行优先排序(例如cc++)用于存储二维数组。例如,考虑以下矩阵:

1 2
3 4

In row-major ordering, this is stored in memory as 1 2 3 4; in column-major ordering, this would be stored as 1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this veryoften in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.

在行优先排序中,这在内存中存储为1 2 3 4;在列主要排序中,这将存储为1 3 2 4. 很容易看出,不利用这种排序的实现将很快遇到(很容易避免!)缓存问题。不幸的是,我看到这样的东西常在我的域(机器学习)。@MatteoItalia 在他的回答中更详细地展示了这个例子。

When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).

当从内存中获取矩阵的某个元素时,它附近的元素也将被获取并存储在缓存线中。如果利用排序,这将导致更少的内存访问(因为后续计算所需的下几个值已经在缓存行中)。

For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it M):

为简单起见,假设高速缓存包含一个可以包含 2 个矩阵元素的单个高速缓存行,并且当从内存中获取给定元素时,下一个也是如此。假设我们想要对上面示例 2x2 矩阵中的所有元素求和(我们称之为M):

Exploiting the ordering (e.g. changing column index first in c++):

利用排序(例如在c++ 中首先更改列索引):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses

Not exploiting the ordering (e.g. changing row index first in c++):

不利用排序(例如在c++ 中首先更改行索引):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses

In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be muchlarger.

在这个简单的例子中,利用排序大约使执行速度加倍(因为内存访问需要比计算总和更多的周期)。在实际应用中,性能差异可能是很多大。

Avoid unpredictable branches

避免不可预测的分支

Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.

现代体系结构具有流水线和编译器在重新排序代码以最大限度地减少由于内存访问而导致的延迟方面变得非常擅长。当您的关键代码包含(不可预测的)分支时,就很难或不可能预取数据。这将间接导致更多的缓存未命中。

This is explained verywell here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?

这在这里解释很好(感谢@0x90 提供链接):为什么处理排序数组比处理未排序数组更快?

Avoid virtual functions

避免虚函数

In the context of c++, virtualmethods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens ifthe specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?

c++的上下文中,virtual方法代表了一个关于缓存未命中的有争议的问题(普遍存在的共识是,就性能而言,应尽可能避免它们)。虚拟功能可以查找过程中引起高速缓存未命中,但是这只是发生,如果不经常被称为特定功能(否则它可能会被缓存),所以这被看作是由一些非的问题。有关此问题的参考,请查看:在 C++ 类中拥有虚方法的性能成本是多少?

Common problems

常见问题

A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?

具有多处理器缓存的现代体系结构中的一个常见问题称为错误共享。当每个单独的处理器试图使用另一个内存区域中的数据并将其存储在同一高速缓存行中时,就会发生这种情况。这会导致缓存线——包含另一个处理器可以使用的数据——被一次又一次地覆盖。实际上,在这种情况下,不同的线程通过引起缓存未命中来使彼此等待。另请参阅(感谢@Matt 提供链接):如何以及何时与缓存行大小对齐?

An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.

RAM 内存中缓存不良的一个极端症状(在这种情况下可能不是您的意思)是所谓的thrashing。当进程连续产生需要磁盘访问的页面错误(例如访问不在当前页面中的内存)时,就会发生这种情况。

回答by Matteo Italia

In addition to @Marc Claesen's answer, I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array (e.g. a bitmap image) column-wise instead of row-wise.

除了@Marc Claesen 的回答之外,我认为缓存不友好代码的一个有启发性的经典示例是按列而不是按行扫描 C 二维数组(例如位图图像)的代码。

Elements that are adjacent in a row are also adjacent in memory, thus accessing them in sequence means accessing them in ascending memory order; this is cache-friendly, since the cache tends to prefetch contiguous blocks of memory.

一行中相邻的元素在内存中也是相邻的,因此按顺序访问它们意味着按内存升序访问它们;这是缓存友好的,因为缓存倾向于预取连续的内存块。

Instead, accessing such elements column-wise is cache-unfriendly, since elements on the same column are distant in memory from each other (in particular, their distance is equal to the size of the row), so when you use this access pattern you are jumping around in memory, potentially wasting the effort of the cache of retrieving the elements nearby in memory.

相反,按列访问这些元素对缓存不友好,因为同一列上的元素在内存中彼此相距很远(特别是,它们的距离等于行的大小),因此当您使用此访问模式时正在内存中跳跃,可能会浪费缓存检索内存中附近元素的工作。

And all that it takes to ruin the performance is to go from

而破坏表演所需要的一切就是从

// Cache-friendly version - processes pixels which are adjacent in memory
for(unsigned int y=0; y<height; ++y)
{
    for(unsigned int x=0; x<width; ++x)
    {
        ... image[y][x] ...
    }
}

to

// Cache-unfriendly version - jumps around in memory for no good reason
for(unsigned int x=0; x<width; ++x)
{
    for(unsigned int y=0; y<height; ++y)
    {
        ... image[y][x] ...
    }
}

This effect can be quite dramatic (several order of magnitudes in speed) in systems with small caches and/or working with big arrays (e.g. 10+ megapixels 24 bpp images on current machines); for this reason, if you have to do many vertical scans, often it's better to rotate the image of 90 degrees first and perform the various analysis later, limiting the cache-unfriendly code just to the rotation.

在具有小缓存和/或使用大阵列(例如,当前机器上的 10+ 百万像素 24 bpp 图像)的系统中,这种效果可能非常显着(速度上的几个数量级);因此,如果您必须进行多次垂直扫描,通常最好先将图像旋转 90 度,然后再进行各种分析,将不利于缓存的代码限制在旋转上。

回答by Jerry Coffin

Optimizing cache usage largely comes down to two factors.

优化缓存使用主要归结为两个因素。

Locality of Reference

参考地点

The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.

第一个因素(其他人已经提到过)是参考地点。参考位置确实有两个维度:空间和时间。

  • Spatial
  • 空间

The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.

空间维度也归结为两件事:首先,我们希望密集地打包我们的信息,以便在有限的内存中容纳更多信息。这意味着(例如)您需要在计算复杂性方面进行重大改进,以证明基于由指针连接的小节点的数据结构的合理性。

Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.

其次,我们希望将一起处理的信息也放在一起。典型的缓存工作在“行”中,这意味着当您访问某些信息时,附近地址的其他信息将与我们接触的部分一起加载到缓存中。例如,当我触摸一个字节时,缓存可能会在该字节附近加载 128 或 256 个字节。为了利用这一点,您通常希望排列数据以最大限度地提高您也使用同时加载的其他数据的可能性。

For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.

举一个非常简单的例子,这可能意味着线性搜索比二分搜索更具竞争力。从缓存行加载一项后,几乎可以免费使用该缓存行中的其余数据。只有当数据足够大以至于二分搜索减少了您访问的缓存行数时,二分搜索才会变得明显更快。

  • Time
  • 时间

The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.

时间维度意味着当您对某些数据进行某些操作时,您希望(尽可能多地)一次对该数据进行所有操作。

Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarrayoverloads most arithmetic operators, so I can (for example) say a = b + c + d;(where a, b, cand dare all valarrays) to do element-wise addition of those arrays.

由于您已将其标记为 C++,因此我将指出一个相对缓存不友好设计的经典示例:std::valarray. valarray超载最算术运算符,这样我就可以(例如)说a = b + c + d;(其中abcd都是valarrays)做的那些阵列元素方面的加法。

The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n]and d[n]once, do the computation, write the result, increment nand repeat 'til we're done.2

这样做的问题是它遍历一对输入,将结果放入临时对象,遍历另一对输入,依此类推。对于大量数据,一次计算的结果可能会在用于下一次计算之前从缓存中消失,因此我们最终会在获得最终结果之前反复读取(和写入)数据。如果最终结果的每个元素都类似于(a[n] + b[n]) * (c[n] + d[n]);,我们通常更愿意读取每个a[n]b[n]c[n]d[n]一次,进行计算,写入结果,递增n并重复直到我们完成。2

Line Sharing

线路共享

The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.

第二个主要因素是避免线路共享。为了理解这一点,我们可能需要回顾一下缓存的组织方式。最简单的缓存形式是直接映射。这意味着主内存中的一个地址只能存储在缓存中的一个特定位置。如果我们使用映射到缓存中同一个位置的两个数据项,它会工作得很糟糕——每次我们使用一个数据项时,另一个必须从缓存中刷新,为另一个腾出空间。缓存的其余部分可能是空的,但这些项目不会使用缓存的其他部分。

To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used3item among those four, flushes it to main memory, and loads the new item in its place.

为了防止这种情况,大多数缓存都是所谓的“集合关联”。例如,在 4 路组关联缓存中,主内存中的任何项目都可以存储在缓存的 4 个不同位置中的任何一个。因此,当缓存要加载一个项目时,它会在这四个项目中查找最近最少使用的3 个项目,将其刷新到主内存,并在其位置加载新项目。

The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.

问题可能相当明显:对于直接映射缓存,碰巧映射到同一缓存位置的两个操作数可能会导致不良行为。N-way set-associative 缓存将数量从 2 增加到 N+1。将缓存组织成更多“路”需要额外的电路并且通常运行速度较慢,因此(例如)8192 路组关联缓存也很少是一个好的解决方案。

Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).

最终,这个因素在可移植代码中更难控制。您对数据放置位置的控制通常相当有限。更糟糕的是,从地址到缓存的确切映射在其他相似的处理器之间有所不同。然而,在某些情况下,可能值得做一些事情,例如分配一个大缓冲区,然后仅使用您分配的部分内容来确保数据不会共享相同的缓存行(即使您可能需要检测确切的处理器和采取相应的行动来做到这一点)。

  • False Sharing
  • 虚假分享

There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).

还有另一个相关项目称为“虚假共享”。这出现在多处理器或多核系统中,其中两个(或更多)处理器/内核具有独立的数据,但位于同一高速缓存行中。这迫使两个处理器/内核协调它们对数据的访问,即使每个处理器/内核都有自己的独立数据项。特别是如果两者交替修改数据,这可能会导致速度大幅下降,因为数据必须不断在处理器之间穿梭。这不能通过将缓存组织成更多“方式”或类似的东西来轻松解决。防止它的主要方法是确保两个线程很少(最好从不)修改可能位于同一高速缓存行中的数据(对于控制分配数据的地址的难度相同的警告)。



  1. Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarraygets used, I'd be at least a little surprised to see anybody do so either.

  2. In case anybody wonders how valarray(designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.

  3. Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.

  1. 那些熟悉 C++ 的人可能想知道这是否可以通过表达式模板之类的东西进行优化。我很确定答案是肯定的,这是可以做到的,如果是,那可能是一场相当大的胜利。我不知道有人这样做过,但是,考虑valarray到使用的很少,看到有人这样做我至少会有点惊讶。

  2. 如果有人想知道valarray(专为性能而设计)怎么会如此严重错误,归结为一件事:它确实是为像较旧的 Crays 这样的机器设计的,这些机器使用快速主内存且没有缓存。对他们来说,这真的是一个近乎理想的设计。

  3. 是的,我正在简化:大多数缓存并没有真正精确地测量最近最少使用的项目,但它们使用了一些旨在接近该值的启发式方法,而不必为每次访问保留完整的时间戳。

回答by arul

Welcome to the world of Data Oriented Design. The basic mantra is to Sort, Eliminate Branches, Batch, Eliminate virtualcalls - all steps towards better locality.

欢迎来到面向数据的设计世界。基本的咒语是排序、消除分支、批处理、消除virtual调用——所有步骤都是为了更好的局部性。

Since you tagged the question with C++, here's the obligatory typical C++ Bullshit. Tony Albrecht's Pitfalls of Object Oriented Programmingis also a great introduction into the subject.

由于您使用 C++ 标记了问题,因此这是强制性的典型 C++ Bullshit。Tony Albrecht 的面向对象编程陷阱也是对该主题的一个很好的介绍。

回答by Krazy Glew

Just piling on: the classic example of cache-unfriendly versus cache-friendly code is the "cache blocking" of matrix multiply.

只是堆积:缓存不友好代码与缓存友好代码的经典示例是矩阵乘法的“缓存阻塞”。

Naive matrix multiply looks like:

朴素矩阵乘法看起来像:

for(i=0;i<N;i++) {
   for(j=0;j<N;j++) {
      dest[i][j] = 0;
      for( k==;k<N;i++) {
         dest[i][j] += src1[i][k] * src2[k][j];
      }
   }
}

If Nis large, e.g. if N * sizeof(elemType)is greater than the cache size, then every single access to src2[k][j]will be a cache miss.

如果N很大,例如如果N * sizeof(elemType)大于缓存大小,则每次访问都src2[k][j]将是缓存未命中。

There are many different ways of optimizing this for a cache. Here's a very simple example: instead of reading one item per cache line in the inner loop, use all of the items:

有许多不同的方法可以优化缓存。这是一个非常简单的示例:不是在内循环中每个缓存行读取一个项目,而是使用所有项目:

int itemsPerCacheLine = CacheLineSize / sizeof(elemType);

for(i=0;i<N;i++) {
   for(j=0;j<N;j += itemsPerCacheLine ) {
      for(jj=0;jj<itemsPerCacheLine; jj+) {
         dest[i][j+jj] = 0;
      }
      for( k=0;k<N;k++) {
         for(jj=0;jj<itemsPerCacheLine; jj+) {
            dest[i][j+jj] += src1[i][k] * src2[k][j+jj];
         }
      }
   }
}

If the cache line size is 64 bytes, and we are operating on 32 bit (4 byte) floats, then there are 16 items per cache line. And the number of cache misses via just this simple transformation is reduced approximately 16-fold.

如果缓存行大小为 64 字节,并且我们在 32 位(4 字节)浮点数上运行,那么每个缓存行有 16 个项目。仅通过这种简单的转换,缓存未命中的数量就减少了大约 16 倍。

Fancier transformations operate on 2D tiles, optimize for multiple caches (L1, L2, TLB), and so on.

更高级的转换在 2D 切片上运行,针对多个缓存(L1、L2、TLB)进行优化,等等。

Some results of googling "cache blocking":

谷歌搜索“缓存阻塞”的一些结果:

http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf

http://stumptown.cc.gt.atl.ga.us/cse6230-hpcta-fa11/slides/11a-matmul-goto.pdf

http://software.intel.com/en-us/articles/cache-blocking-techniques

http://software.intel.com/en-us/articles/cache-blocking-techniques

A nice video animation of an optimized cache blocking algorithm.

优化缓存阻塞算法的漂亮视频动画。

http://www.youtube.com/watch?v=IFWgwGMMrh0

http://www.youtube.com/watch?v=IFWgwGMMrh0

Loop tiling is very closely related:

循环平铺是非常密切相关的:

http://en.wikipedia.org/wiki/Loop_tiling

http://en.wikipedia.org/wiki/Loop_tiling

回答by Rafael Baptista

Processors today work with many levels of cascading memory areas. So the CPU will have a bunch of memory that is on the CPU chip itself. It has very fast access to this memory. There are different levels of cache each one slower access ( and larger ) than the next, until you get to system memory which is not on the CPU and is relatively much slower to access.

今天的处理器使用多级级联内存区域。所以CPU会有一堆内存在CPU芯片上。它可以非常快速地访问此内存。有不同级别的缓存,每一个访问速度都比下一个慢(并且更大),直到您到达不在 CPU 上并且访问速度相对慢得多的系统内存。

Logically, to the CPU's instruction set you just refer to memory addresses in a giant virtual address space. When you access a single memory address the CPU will go fetch it. in the old days it would fetch just that single address. But today the CPU will fetch a bunch of memory around the bit you asked for, and copy it into the cache. It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon. For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other.

从逻辑上讲,对于 CPU 的指令集,您只需引用巨大虚拟地址空间中的内存地址。当您访问单个内存地址时,CPU 将去获取它。在过去,它只会获取那个地址。但是今天,CPU 将围绕您要求的位获取一堆内存,并将其复制到缓存中。它假设如果您询问一个特定地址,您很可能很快就会询问附近的地址。例如,如果您正在复制缓冲区,您将从连续地址读取和写入 - 一个接一个。

So today when you fetch an address it checks the first level of cache to see if it already read that address into cache, if it doesn't find it, then this is a cache miss and it has to go out to the next level of cache to find it, until it eventually has to go out into main memory.

所以今天当你获取一个地址时,它会检查第一级缓存,看看它是否已经将该地址读入缓存,如果没有找到,那么这是一个缓存未命中,它必须进入下一级缓存缓存找到它,直到它最终必须进入主内存。

Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses.

缓存友好代码尝试将访问保持在内存中,以便您最大限度地减少缓存未命中。

So an example would be imagine you wanted to copy a giant 2 dimensional table. It is organized with reach row in consecutive in memory, and one row follow the next right after.

所以一个例子是想象你想复制一个巨大的二维表。它在内存中以连续到达行组织,一行紧随其后。

If you copied the elements one row at a time from left to right - that would be cache friendly. If you decided to copy the table one column at a time, you would copy the exact same amount of memory - but it would be cache unfriendly.

如果您一次从左到右复制一行元素 - 这将是缓存友好的。如果您决定一次复制一列表,您将复制完全相同的内存量 - 但它对缓存不友好。

回答by Olof Forshell

It needs to be clarified that not only data should be cache-friendly, it is just as important for the code. This is in addition to branch predicition, instruction reordering, avoiding actual divisions and other techniques.

需要澄清的是,不仅数据应该是缓存友好的,它对代码也同样重要。这是对分支预测、指令重新排序、避免实际除法和其他技术的补充。

Typically the denser the code, the fewer cache lines will be required to store it. This results in more cache lines being available for data.

通常,代码越密集,存储它所需的缓存行就越少。这导致更多的缓存线可用于数据。

The code should not call functions all over the place as they typically will require one or more cache lines of their own, resulting in fewer cache lines for data.

代码不应到处调用函数,因为它们通常需要一个或多个自己的缓存线,从而导致数据缓存线更少。

A function should begin at a cache line-alignment-friendly address. Though there are (gcc) compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line. For example, if three of the most often used functions fit inside one 64 byte cache line, this is less wasteful than if each one has its own line and results in two cache lines less available for other usage. A typical alignment value could be 32 or 16.

函数应该从缓存行对齐友好的地址开始。尽管有 (gcc) 编译器开关,但请注意,如果函数非常短,则每个函数占用整个缓存行可能会造成浪费。例如,如果三个最常用的函数适合一个 64 字节高速缓存行,这比每个函数都有自己的行并导致两个高速缓存行不能用于其他用途时浪费更少。典型的对齐值可能是 32 或 16。

So spend some extra time to make the code dense. Test different constructs, compile and review the generated code size and profile.

所以花一些额外的时间来使代码密集。测试不同的结构,编译并查看生成的代码大小和配置文件。

回答by Olof Forshell

As @Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.

正如@Marc Claesen 提到的,编写缓存友好代码的一种方法是利用我们的数据存储结构。除此之外,另一种编写缓存友好代码的方法是:改变我们数据的存储方式;然后编写新代码来访问存储在这个新结构中的数据。

This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Productbeing stored has 3 attributes i.e. int32_t key, char name[56]and int32_t price, so the total size of a tuple is 64bytes.

这在数据库系统如何线性化表的元组并存储它们的情况下是有意义的。有两种基本的方式来存储表的元组,即行存储和列存储。在行存储中,顾名思义,元组按行存储。让我们假设一个名为Product被存储的表有 3 个属性,即int32_t key, char name[56]int32_t price,所以元组的总大小是64字节。

We can simulate a very basic row store query execution in main memory by creating an array of Productstructs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:

我们可以通过创建一个Product大小为 N的结构数组来模拟主内存中非常基本的行存储查询执行,其中 N 是表中的行数。这种内存布局也称为结构数组。所以 Product 的结构可以是这样的:

struct Product
{
   int32_t key;
   char name[56];
   int32_t price'
}

/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */

Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Producttable. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:

类似地,我们可以通过创建 3 个大小为 N 的数组来模拟主内存中非常基本的列存储查询执行,每个Product表的属性一个数组。这种内存布局也称为数组结构。因此 Product 的每个属性的 3 个数组可以是这样的:

/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */

Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Productpresent in our memory.

现在,在加载了结构数组(行布局)和 3 个单独的数组(列布局)之后,我们Product的内存中的表中就有了行存储和列存储。

Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as

现在我们转到缓存友好的代码部分。假设我们的表上的工作量是这样的,我们有一个关于价格属性的聚合查询。如

SELECT SUM(price)
FROM PRODUCT

For the row store we can convert the above SQL query into

对于行存储,我们可以将上面的 SQL 查询转换为

int sum = 0;
for (int i=0; i<N; i++)
   sum = sum + table[i].price;

For the column store we can convert the above SQL query into

对于列存储,我们可以将上面的 SQL 查询转换为

int sum = 0;
for (int i=0; i<N; i++)
   sum = sum + price[i];

The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.

在这个查询中,列存储的代码比行布局的代码要快,因为它只需要属性的一个子集,而在列布局中,我们所做的就是这样,即只访问价格列。

Suppose that the cache line size is 64bytes.

假设缓存行大小是64字节。

In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.

在读取缓存行时的行布局的情况下,仅cacheline_size/product_struct_size = 64/64 = 1读取1( ) 元组的价格值,因为我们的结构大小为 64 字节并且它填满了我们的整个缓存行,因此对于每个元组都会发生缓存未命中的情况的行布局。

In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.

在列布局的情况下,读取缓存行时,读取的是 16( cacheline_size/price_int_size = 64/4 = 16) 个元组的价格值,因为将存储在内存中的 16 个连续的价格值带入缓存中,因此对于每 16 个元组,在以下情况下都会发生缓存未命中列布局。

So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-Hbenchmark, and compare the run times for both the layouts. The wikipediaarticle on column oriented database systems is also good.

因此,在给定查询的情况下,列布局会更快,并且在表的列子集上的此类聚合查询中会更快。您可以使用来自TPC-H基准测试的数据亲自尝试此类实验,并比较两种布局的运行时间。在维基百科上面向列的数据库系统的文章也不错。

So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.

因此,在数据库系统中,如果查询工作负载是预先知道的,我们可以将数据存储在适合工作负载查询的布局中,并从这些布局访问数据。在上面的例子中,我们创建了一个列布局并更改了我们的代码来计算 sum,以便它变得对缓存友好。

回答by Tuntable

Be aware that caches do not just cache continuous memory. They have multiple lines (at least 4) so discontinous and overlapping memory can often be stored just as efficiently.

请注意,缓存不仅仅缓存连续内存。它们有多行(至少 4 行),因此通常可以同样有效地存储不连续和重叠的内存。

What is missing from all the above examples is measured benchmarks. There are many myths about performance. Unless you measure it you do not know. Do not complicate your code unless you have a measuredimprovement.

上述所有示例都缺少测量基准。关于性能有很多神话。除非你测量它,否则你不知道。除非您有可衡量的改进,否则不要使您的代码复杂化。