C++11 引入了标准化的内存模型。这是什么意思?它将如何影响 C++ 编程?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6319146/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:54:14  来源:igfitidea点击:

C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?

c++multithreadingc++11language-lawyermemory-model

提问by Nawaz

C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?

C++11 引入了标准化的内存模型,但这究竟是什么意思?它将如何影响 C++ 编程?

This article(by Gavin Clarkewho quotes Herb Sutter) says that,

这篇文章加文·克拉克引用赫伯·萨特的话)说,

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.

"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Suttersaid.

内存模型意味着 C++ 代码现在有一个标准化的库可以调用,而不管编译器是谁制作的以及它在什么平台上运行。有一种标准方法可以控制不同线程如何与处理器的内存通信。

“当您谈论在标准中的不同内核之间拆分 [代码] 时,我们谈论的是内存模型。我们将对其进行优化,而不会破坏人们将在代码中做出的以下假设,”萨特说。

Well, I can memorizethis and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.

嗯,我可以记住这个和在线可用的类似段落(因为我从出生起就有自己的记忆模型:P),甚至可以发布作为其他人提出的问题的答案,但老实说,我不完全理解这个。

C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.

C++ 程序员以前也开发过多线程应用程序,那么到底是 POSIX 线程,还是 Windows 线程,还是 C++11 线程呢?有什么好处?我想了解底层细节。

I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?

我也有这种感觉,即 C++11 内存模型与 C++11 多线程支持有某种关系,因为我经常看到这两者在一起。如果是,具体如何?为什么他们应该有关系?

As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)

由于我不知道多线程内部是如何工作的,以及内存模型的一般含义,请帮助我理解这些概念。:-)

回答by Nemo

First, you have to learn to think like a Language Lawyer.

首先,您必须学会像语言律师一样思考。

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machinethat is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

C++ 规范没有提及任何特定的编译器、操作系统或 CPU。它引用了抽象机器,它是实际系统的概括。在语言律师的世界里,程序员的工作是为抽象机器编写代码;编译器的工作是在具体机器上实现该代码。通过严格按照规范编码,您可以确定您的代码无需修改即可在任何具有兼容 C++ 编译器的系统上编译和运行,无论是现在还是 50 年后。

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicityof memory loads and stores or the orderin which loads and stores might happen, never mind things like mutexes.

C++98/C++03 规范中的抽​​象机基本上是单线程的。因此,不可能编写相对于规范“完全可移植”的多线程 C++ 代码。该规范甚至没有说明内存加载和存储的原子性或加载和存储可能发生的顺序,更不用说互斥锁之类的事情了。

Of course, you can write multi-threaded code in practice for particular concrete systems – like pthreads or Windows. But there is no standardway to write multi-threaded code for C++98/C++03.

当然,您可以在实践中为特定的具体系统编写多线程代码——比如 pthreads 或 Windows。但是对于 C++98/C++03没有标准的方法来编写多线程代码。

The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

C++11 中的抽象机在设计上是多线程的。它还具有明确定义的内存模型;也就是说,它说明了编译器在访问内存时可以做什么和不可以做什么。

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

考虑以下示例,其中两个线程同时访问一对全局变量:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

线程 2 可能输出什么?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaninglessbecause the standard does not contemplate anything called a "thread".

在 C++98/C++03 下,这甚至不是未定义行为;这个问题本身毫无意义,因为标准没有考虑任何称为“线程”的东西。

Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

在 C++11 下,结果是 Undefined Behavior,因为加载和存储通常不需要是原子的。这似乎没有太大的改进......而就其本身而言,事实并非如此。

But with C++11, you can write this:

但是使用 C++11,你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0(if it runs before Thread 1), 37 17(if it runs after Thread 1), or 0 17(if it runs after Thread 1 assigns to x but before it assigns to y).

现在事情变得更有趣了。首先,这里的行为是定义的。线程 2 现在可以打印0 0(如果它在线程 1 之前运行)、37 17(如果它在线程 1 之后运行)或0 17(如果它在线程 1 分配给 x 之后但在分配给 y 之前运行)。

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicityand orderingfor loads and stores.

它无法打印的是37 0,因为 C++11 中原子加载/存储的默认模式是强制执行顺序一致性。这只是意味着所有加载和存储都必须“好像”它们按照您在每个线程中编写它们的顺序发生,而线程之间的操作可以根据系统的需要进行交错。所以原子的默认行为为加载和存储提供原子性排序

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0as output from this program, then you can write this:

现在,在现代 CPU 上,确保顺序一致性的成本可能很高。特别是,编译器很可能会在此处的每次访问之间发出全面的内存屏障。但是如果你的算法可以容忍无序加载和存储;即,如果它需要原子性但不需要排序;即,如果它可以容忍37 0作为这个程序的输出,那么你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

CPU 越现代,它就越有可能比前面的示例更快。

Finally, if you just need to keep particular loads and stores in order, you can write:

最后,如果您只需要按顺序保持特定的加载和存储,您可以编写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores – so 37 0is no longer a possible output – but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

这将我们带回到有序的加载和存储——因此37 0不再是可能的输出——但它以最小的开销实现了这一点。(在这个简单的例子中,结果与完全成熟的顺序一致性相同;在更大的程序中,它不会。)

Of course, if the only outputs you want to see are 0 0or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

当然,如果您希望看到的唯一输出是0 0or 37 17,您可以在原始代码周围包裹一个互斥锁。但是,如果您已经读到这里,我敢打赌您已经知道它是如何工作的,而且这个答案已经比我预期的要长:-)。

So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

所以,底线。互斥体很棒,C++11 对它们进行了标准化。但有时出于性能原因,您需要较低级别的原语(例如,经典的双重检查锁定模式)。新标准提供了诸如互斥体和条件变量之类的高级小工具,并且还提供了诸如原子类型和各种类型的内存屏障之类的低级小工具。因此,现在您可以完全使用标准指定的语言编写复杂的高性能并发例程,并且您可以确定您的代码将在今天和明天的系统上编译和运行不变。

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

坦率地说,除非您是专家并且正在处理一些严肃的低级代码,否则您可能应该坚持使用互斥锁和条件变量。这就是我打算做的。

For more on this stuff, see this blog post.

有关此内容的更多信息,请参阅此博客文章

回答by Ahmed Nassar

I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System". The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.

我只会给出我理解的内存一致性模型(或简称内存模型)的类比。它的灵感来自 Leslie Lamport 的开创性论文“分布式系统中的时间、时钟和事件排序”。这个比喻很贴切,具有根本意义,但对很多人来说可能有点矫枉过正。但是,我希望它提供一个有助于推理内存一致性模型的心理图像(图形表示)。

Let's view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular timeby a particular thread.

让我们在时空图中查看所有内存位置的历史,其中横轴表示地址空间(即每个内存位置由该轴上的一个点表示),纵轴表示时间(我们将看到,一般来说,没有一个普遍的时间概念)。因此,每个内存位置保存的值的历史记录由该内存地址处的垂直列表示。每个值的变化都是由于其中一个线程向该位置写入了一个新值。通过一个存储图像,我们将意味着所有的内存位置观察到的价值的总和/组合在特定的时间特定线程

Quoting from "A Primer on Memory Consistency and Cache Coherence"

引自“内存一致性和缓存一致性入门”

The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.

直观(也是最严格的)内存模型是顺序一致性 (SC),其中多线程执行应该看起来像是每个组成线程的顺序执行的交错,就好像这些线程在单核处理器上进行了时间复用。

That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity(i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of valuesof all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.

该全局内存顺序可能因程序的一次运行而异,并且可能事先未知。SC 的特征是地址-空间-时间图中表示同时性平面(即内存映像)的一组水平切片。在给定的平面上,它的所有事件(或内存值)都是同时发生的。有一个绝对时间的概念,其中所有线程都同意哪些内存值是同时发生的。在 SC 中,在每一时刻,只有一个内存映像被所有线程共享。也就是说,在每个时刻,所有处理器都同意内存映像(即内存的聚合内容)。这不仅意味着所有线程查看所有内存位置的相同值序列,而且所有处理器都观察到相同的值。所有变量组合。这与说所有内存操作(在所有内存位置上)被所有线程以相同的总顺序观察是相同的。

In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.

在宽松的内存模型中,每个线程都会以自己的方式对地址空间时间进行切片,唯一的限制是每个线程的切片不能相互交叉,因为所有线程必须就每个单独的内存位置的历史达成一致(当然,不同线程的切片可能并且将会相互交叉)。没有通用的方法来分割它(没有地址空间时间的特权叶)。切片不必是平面(或线性)。它们可以是弯曲的,这使得一个线程可以按照写入顺序读取另一个线程写入的值。当任何特定线程查看时,不同内存位置的历史记录可能相对于彼此任意滑动(或拉伸). 每个线程对哪些事件(或等效地,内存值)同时发生会有不同的感觉。与一个线程同时发生的一组事件(或内存值)与另一个线程不同时发生。因此,在宽松的内存模型中,所有线程仍然观察每个内存位置的相同历史记录(即值序列)。但是他们可能会观察到不同的记忆图像(即所有记忆位置的值的组合)。即使同一线程依次写入两个不同的内存位置,其他线程也可能以不同的顺序观察到两个新写入的值。

[Picture from Wikipedia] Picture from Wikipedia

[图片来自维基百科] 图片来自维基百科

Readers familiar with Einstein's Special Theory of Relativitywill notice what I am alluding to. Translating Minkowski's words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observersthat are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time(i.e., all observers agree on one absolute order of events and a global sense of simultaneity).

熟悉爱因斯坦狭义相对论的读者会注意到我在暗示什么。将闵可夫斯基的话翻译成内存模型领域:地址空间和时间是地址空间时间的影子。在这种情况下,每个观察者(即线程)将事件(即内存存储/加载)的阴影投射到他自己的世界线(即他的时间轴)和他自己的同时性平面(他的地址空间轴)上. C++11 内存模型中的线程对应于狭义相对论中相对于彼此移动的观察者。顺序一致性对应于伽利略时空(即,所有观察者都同意事件的一个绝对顺序和全局同时感)。

The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).

记忆模型和狭义相对论之间的相似之处源于这样一个事实,即两者都定义了一组部分有序的事件,通常称为因果集。某些事件(即内存存储)可以影响(但不受)其他事件。C++11 线程(或物理学中的观察者)只不过是事件链(即完全有序的集合)(例如,内存加载和存储到可能不同的地址)。

In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered. Time in Physics, Craig Callender.

在相对论中,部分有序事件看似混乱的画面中恢复了某种秩序,因为所有观察者都同意的唯一时间顺序是“类时”事件之间的排序(即,原则上可以通过任何速度变慢的粒子连接的事件)比真空中的光速快)。只有与时间相关的事件是不变排序的。 物理学时间,克雷格卡伦德

In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.

在 C++11 内存模型中,使用类似的机制(获取-释放一致性模型)来建立这些局部因果关系

To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"

为了提供内存一致性的定义和放弃 SC 的动机,我将引用“A Primer on Memory Consistency and Cache Coherence”

For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor's architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executionswhile disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.

Relaxedor weakmemory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.

Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.

对于共享内存机器,内存一致性模型定义了其内存系统的架构可见行为。单个处理器内核的正确性标准在“一个正确的结果”和“许多不正确的替代方案”之间划分行为。这是因为处理器的架构要求线程的执行将给定的输入状态转换为单个明确定义的输出状态,即使在无序内核上也是如此。然而,共享内存一致性模型涉及多个线程的加载和存储,通常允许多次正确执行同时禁止许多(更多)不正确的。多次正确执行的可能性是由于 ISA 允许多个线程同时执行,通常来自不同线程的指令有许多可能的合法交错。

宽松内存一致性模型的动机是强模型中的大多数内存排序是不必要的。如果一个线程更新十个数据项然后更新一个同步标志,程序员通常不关心这些数据项是否按彼此的顺序更新,而只关心所有数据项在标志更新之前更新(通常使用 FENCE 指令实现) )。宽松模型试图捕捉这种增加的排序灵活性,并仅保留程序员“需要的订单””以获得更高的性能和SC的正确性。例如,在某些架构中,每个内核使用 FIFO 写缓冲区来保存提交(退休)存储的结果,然后再将结果写入缓存。这种优化提高了性能,但违反了 SC。写缓冲区隐藏了服务存储未命中的延迟。因为商店很常见,所以能够避免在大多数商店中停滞不前是一个重要的好处。对于单核处理器,即使对 A 的一个或多个存储在写入缓冲区中,也可以通过确保地址 A 的加载将最近存储的值返回给 A,从而使写入缓冲区在架构上不可见。这通常是通过将最近存储到 A 的值绕过到从 A 加载的值来完成的,其中“最近”由程序顺序确定,或者,如果对 A 的存储在写入缓冲区中,则停止加载 A。当使用多个内核时,每个内核都有自己的旁路写缓冲区。没有写缓冲区,硬件是 SC,但有写缓冲区,它不是,这使得写缓冲区在多核处理器中在架构上可见。

如果内核具有非 FIFO 写入缓冲区,允许存储以不同于它们进入的顺序的顺序离开,则可能会发生存储-存储重新排序。如果第一个存储在缓存中未命中而第二个命中,或者如果第二个存储可以与较早的存储(即在第一个存储之前)合并,则可能会发生这种情况。加载-加载重新排序也可能发生在不按程序顺序执行指令的动态调度内核上。这与在另一个核心上重新排序存储的行为相同(您能想出一个在两个线程之间交错的示例吗?)。将较早的加载与较晚的存储重新排序(加载-存储重新排序)可能会导致许多不正确的行为,例如在释放保护它的锁(如果存储是解锁操作)之后加载值。

Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:

因为缓存一致性和内存一致性有时会混淆,所以也有这句话很有启发性:

Unlike consistency, cache coherenceis neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functionalbehavior (programmers may still be able to infer likely cache structure using timinginformation). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location. An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to allmemory locations.

与一致性不同,缓存一致性对软件既不可见,也不需要。Coherence 试图使共享内存系统的缓存在功能上与单核系统中的缓存一样不可见。正确的一致性确保程序员无法通过分析加载和存储的结果来确定系统是否有缓存以及缓存在何处。这是因为正确的一致性确保缓存永远不会启用新的或不同的功能行为(程序员仍然可以使用时序推断可能的缓存结构信息)。缓存一致性协议的主要目的是维护每个内存位置的单写入器多读取器 (SWMR) 不变性。一致性和一致性之间的一个重要区别是一致性是基于每个内存位置指定的,而一致性是针对所有内存位置指定的。

Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.

继续我们的心理图景,SWMR 不变量对应于物理要求,即在任何一个位置最多只能有一个粒子,但在任何位置可以有无限数量的观察者。

回答by eran

This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.

这是一个多年以前的问题,但非常受欢迎,值得一提的是学习 C++11 内存模型的绝佳资源。我认为没有必要总结他的演讲以做出另一个完整的答案,但鉴于这是实际编写标准的人,我认为非常值得观看演讲。

Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1and part 2. The talk is pretty technical, and covers the following topics:

Herb Sutter 对 C++11 内存模型进行了长达三个小时的讨论,标题为“原子<>武器”,可在 Channel9 站点上获得 -第 1部分第 2 部分。该演讲非常技术性,涵盖以下主题:

  1. Optimizations, Races, and the Memory Model
  2. Ordering – What: Acquire and Release
  3. Ordering – How: Mutexes, Atomics, and/or Fences
  4. Other Restrictions on Compilers and Hardware
  5. Code Gen & Performance: x86/x64, IA64, POWER, ARM
  6. Relaxed Atomics
  1. 优化、竞争和内存模型
  2. 订购 – 什么:获取和发布
  3. 订购 - 如何:互斥体、原子和/或栅栏
  4. 编译器和硬件的其他限制
  5. 代码生成和性能:x86/x64、IA64、POWER、ARM
  6. 轻松原子

The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).

演讲没有详细介绍 API,而是详细介绍推理、背景、幕后和幕后(您是否知道将宽松语义添加到标准中只是因为 POWER 和 ARM 不有效地支持同步加载?)。

回答by Puppy

It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::stringwhen we could all be using a home-rolled stringclass.

这意味着标准现在定义了多线程,它定义了在多线程上下文中发生的事情。当然,人们使用了不同的实现,但这就像问为什么我们应该有一个std::string我们都可以使用家庭滚动的string类。

When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.

当您谈论 POSIX 线程或 Windows 线程时,这有点像您在谈论 x86 线程时的错觉,因为它是并发运行的硬件功能。C++0x 内存模型可以保证,无论您使用的是 x86、ARM、MIPS还是您能想到的任何其他东西。

回答by ritesh

For languages not specifying a memory model, you are writing code for the language andthe memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races(a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.

对于未指定内存模型的语言,您正在为处理器架构指定的语言内存模型编写代码。处理器可以选择重新排序内存访问以提高性能。因此,如果您的程序存在数据竞争(数据竞争是指多个内核/超线程可以同时访问同一内存),那么您的程序不是跨平台的,因为它依赖于处理器内存模型。您可以参考 Intel 或 AMD 软件手册以了解处理器如何重新排序内存访问。

Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.

非常重要的是,锁(和带锁的并发语义)通常以跨平台方式实现......所以如果您在没有数据竞争的多线程程序中使用标准锁,那么您不必担心跨平台内存模型.

Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).

有趣的是,微软的 C++ 编译器已经获得/释放了 volatile 的语义,这是一个 C++ 扩展,用于处理 C++ 中缺乏内存模型http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs .80).aspx。然而,鉴于 Windows 仅在 x86 / x64 上运行,这并没有多大意义(英特尔和 AMD 内存模型使得在语言中实现获取/释放语义变得容易和高效)。

回答by ninjalj

If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.

如果您使用互斥锁来保护您的所有数据,您真的不必担心。互斥体一直提供足够的排序和可见性保证。

Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.

现在,如果您使用原子或无锁算法,则需要考虑内存模型。内存模型精确地描述了原子何时提供排序和可见性保证,并为手动编码保证提供便携式围栏。

Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).

以前,原子将使用编译器内在函数或一些更高级别的库来完成。Fences 可以使用 CPU 特定的指令(内存屏障)来完成。

回答by Mike Spear

The above answers get at the most fundamental aspects of the C++ memory model. In practice, most uses of std::atomic<>"just work", at least until the programmer over-optimizes (e.g., by trying to relax too many things).

上述答案涉及 C++ 内存模型的最基本方面。在实践中,std::atomic<>“just work”的大多数用法,至少在程序员过度优化之前(例如,通过试图放松太多事情)。

There is one place where mistakes are still common: sequence locks. There is an excellent and easy-to-read discussion of the challenges at https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf. Sequence locks are appealing because the reader avoids writing to the lock word. The following code is based on Figure 1 of the above technical report, and it highlights the challenges when implementing sequence locks in C++:

有一个地方仍然常见错误:序列锁https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf 上有关于挑战的出色且易于阅读的讨论。序列锁很有吸引力,因为读取器避免写入锁字。以下代码基于上述技术报告的图 1,它突出了在 C++ 中实现序列锁时的挑战:

atomic<uint64_t> seq; // seqlock representation
int data1, data2;     // this data will be protected by seq

T reader() {
    int r1, r2;
    unsigned seq0, seq1;
    while (true) {
        seq0 = seq;
        r1 = data1; // INCORRECT! Data Race!
        r2 = data2; // INCORRECT!
        seq1 = seq;

        // if the lock didn't change while I was reading, and
        // the lock wasn't held while I was reading, then my
        // reads should be valid
        if (seq0 == seq1 && !(seq0 & 1))
            break;
    }
    use(r1, r2);
}

void writer(int new_data1, int new_data2) {
    unsigned seq0 = seq;
    while (true) {
        if ((!(seq0 & 1)) && seq.compare_exchange_weak(seq0, seq0 + 1))
            break; // atomically moving the lock from even to odd is an acquire
    }
    data1 = new_data1;
    data2 = new_data2;
    seq = seq0 + 2; // release the lock by increasing its value to even
}

As unintuitive as it seams at first, data1and data2need to be atomic<>. If they are not atomic, then they could be read (in reader()) at the exact same time as they are written (in writer()). According to the C++ memory model, this is a race even if reader()never actually uses the data. In addition, if they are not atomic, then the compiler can cache the first read of each value in a register. Obviously you wouldn't want that... you want to re-read in each iteration of the whileloop in reader().

一开始看起来很不直观,data1而且data2需要如此atomic<>。如果它们不是原子的,那么它们可以在reader()写入 (in writer()) 的同时被读取(in )。根据 C++ 内存模型,即使reader()从不实际使用数据,这也是一场竞赛。此外,如果它们不是原子的,那么编译器可以在寄存器中缓存每个值的第一次读取。很显然,你不会想...你想在每次迭代重新读取while的循环reader()

It is also not sufficient to make them atomic<>and access them with memory_order_relaxed. The reason for this is that the reads of seq (in reader()) only have acquiresemantics. In simple terms, if X and Y are memory accesses, X precedes Y, X is not an acquire or release, and Y is an acquire, then the compiler can reorder Y before X. If Y was the second read of seq, and X was a read of data, such a reordering would break the lock implementation.

制作它们atomic<>并使用memory_order_relaxed. 这样做的原因是 seq (in reader())的读取只有获取语义。简单来说,如果X和Y是内存访问,X在Y之前,X不是acquire或release,Y是acquire,那么编译器可以将Y重新排序在X之前。如果Y是seq的第二次读取,而X是读取数据,这样的重新排序会破坏锁实现。

The paper gives a few solutions. The one with the best performance today is probably the one that uses an atomic_thread_fencewith memory_order_relaxedbeforethe second read of the seqlock. In the paper, it's Figure 6. I'm not reproducing the code here, because anyone who has read this far really ought to read the paper. It is more precise and complete than this post.

论文给出了一些解决方案。今天表现最好的一个可能是一个使用了一个atomic_thread_fencememory_order_relaxed之前的顺序锁的第二次读取。在论文中,它是图 6。我不在这里复制代码,因为任何读到这里的人都应该阅读这篇论文。它比这篇文章更精确和完整。

The last issue is that it might be unnatural to make the datavariables atomic. If you can't in your code, then you need to be very careful, because casting from non-atomic to atomic is only legal for primitive types. C++20 is supposed to add atomic_ref<>, which will make this problem easier to resolve.

最后一个问题是使data变量原子化可能是不自然的。如果你不能在你的代码中,那么你需要非常小心,因为从非原子到原子的转换只对原始类型是合法的。C++20 应该添加atomic_ref<>,这将使这个问题更容易解决。

To summarize: even if you think you understand the C++ memory model, you should be very careful before rolling your own sequence locks.

总结一下:即使您认为您了解 C++ 内存模型,在滚动您自己的序列锁之前也应该非常小心。

回答by curiousguy

C and C++ used to be defined by an execution trace of a well formed program.

C 和 C++ 过去由格式良好的程序的执行跟踪定义。

Now they are half defined by an execution trace of a program, and half a posteriori by many orderings on synchronisation objects.

现在它们一半由程序的执行跟踪定义,一半由同步对象上的许多排序后验定义。

Meaning that these language definitions make no sense at all as no logical method to mix these two approaches. In particular, destruction of a mutex or atomic variable is not well defined.

这意味着这些语言定义完全没有意义,因为没有混合这两种方法的逻辑方法。特别是,互斥锁或原子变量的销毁没有明确定义。