C++ 什么时候在多线程中使用 volatile?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4557979/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 15:44:48  来源:igfitidea点击:

When to use volatile with multi threading?

c++multithreadingconcurrencyatomicvolatile

提问by David Preston

If there are two threads accessing a global variable then many tutorials say make the variable volatile to prevent the compiler caching the variable in a register and it thus not getting updated correctly. However two threads both accessing a shared variable is something which calls for protection via a mutex isn't it? But in that case, between the thread locking and releasing the mutex the code is in a critical section where only that one thread can access the variable, in which case the variable doesn't need to be volatile?

如果有两个线程访问一个全局变量,那么许多教程说使变量 volatile 以防止编译器将变量缓存在寄存器中,从而无法正确更新。然而,两个线程都访问一个共享变量是需要通过互斥锁来保护的,不是吗?但是在那种情况下,在线程锁定和释放互斥锁之间,代码处于临界区,只有那个线程可以访问变量,在这种情况下变量不需要是 volatile 吗?

So therefore what is the use/purpose of volatile in a multi-threaded program?

那么在多线程程序中 volatile 的用途/目的是什么?

回答by John Dibling

Short & quick answer: volatileis (nearly) useless for platform-agnostic, multithreaded application programming. It does not provide any synchronization, it does not create memory fences, nor does it ensure the order of execution of operations. It does not make operations atomic. It does not make your code magically thread safe. volatilemay be the single-most misunderstood facility in all of C++. See this, thisand thisfor more information about volatile

简短而快速的回答volatile对于平台无关的多线程应用程序编程(几乎)没用。它不提供任何同步,不创建内存栅栏,也不确保操作的执行顺序。它不会使操作具有原子性。它不会使您的代码神奇地线程安全。 volatile可能是所有 C++ 中最容易被误解的工具。有关更多信息,请参阅volatile

On the other hand, volatiledoes have some use that may not be so obvious. It can be used much in the same way one would use constto help the compiler show you where you might be making a mistake in accessing some shared resource in a non-protected way. This use is discussed by Alexandrescu in this article. However, this is basically using the C++ type system in a way that is often viewed as a contrivance and can evoke Undefined Behavior.

另一方面,volatile确实有一些可能不那么明显的用途。它的使用方式const与帮助编译器显示您在以不受保护的方式访问某些共享资源时可能会犯的错误的方式大致相同。Alexandrescu 在本文中讨论了这种用法。然而,这基本上是以一种通常被视为一种设计并可能引起未定义行为的方式使用 C++ 类型系统。

volatilewas specifically intended to be used when interfacing with memory-mapped hardware, signal handlers and the setjmp machine code instruction. This makes volatiledirectly applicable to systems-level programming rather than normal applications-level programming.

volatile专门用于与内存映射硬件、信号处理程序和 setjmp 机器代码指令接口时使用。这使得volatile直接适用于系统级编程而不是普通的应用程序级编程。

The 2003 C++ Standard does not say that volatileapplies any kind of Acquire or Release semantics on variables. In fact, the Standard is completely silent on all matters of multithreading. However, specific platforms do apply Acquire and Release semantics on volatilevariables.

2003 C++ 标准没有说volatile对变量应用任何类型的获取或释放语义。事实上,标准对多线程的所有问题都完全沉默。但是,特定平台确实对volatile变量应用了 Acquire 和 Release 语义。

[Update for C++11]

[C++11 更新]

The C++11 Standard now doesacknowledge multithreading directly in the memory model and the lanuage, and it provides library facilities to deal with it in a platform-independant way. However the semantics of volatilestill have not changed. volatileis still not a synchronization mechanism. Bjarne Stroustrup says as much in TCPPPL4E:

C++11 标准现在确实在内存模型和语言中直接承认多线程,并且它提供了库工具以独立于平台的方式处理它。然而, 的语义volatile仍然没有改变。 volatile仍然不是同步机制。Bjarne Stroustrup 在 TCPPPL4E 中说了同样多的话:

Do not use volatileexcept in low-level code that deals directly with hardware.

Do not assume volatilehas special meaning in the memory model. It does not. It is not -- as in some later languages -- a synchronization mechanism. To get synchronization, use atomic, a mutex, or a condition_variable.

volatile除了直接处理硬件的低级代码之外,不要使用。

不要假设volatile在内存模型中有特殊含义。它不是。它不是——就像在一些后来的语言中一样——是一种同步机制。要获得同步,请使用atomic、 a mutex或 a condition_variable

[/End update]

[/结束更新]

The above all applies the the C++ language itself, as defined by the 2003 Standard (and now the 2011 Standard). Some specific platforms however do add additional functionality or restrictions to what volatiledoes. For example, in MSVC 2010 (at least) Acquire and Release semantics doapply to certain operations on volatilevariables. From the MSDN:

以上都适用于 C++ 语言本身,如 2003 标准(现在是 2011 标准)所定义。然而,一些特定的平台确实增加了额外的功能或限制volatile。例如,在 MSVC 2010 中(至少)Acquire 和 Release 语义确实适用于对volatile变量的某些操作。 从 MSDN

When optimizing, the compiler must maintain ordering among references to volatile objects as well as references to other global objects. In particular,

A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

A read of a volatile object (volatile read) has Acquire semantics; a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

优化时,编译器必须维护对 volatile 对象的引用以及对其他全局对象的引用之间的顺序。特别是,

对易失性对象的写入(易失性写入)具有释放语义;在指令序列中写入易失性对象之前发生的对全局或静态对象的引用将发生在编译后的二进制文件中的易失性写入之前。

对易失性对象的读取(易失性读取)具有 Acquire 语义;在指令序列中读取易失性存储器之后发生的对全局或静态对象的引用将发生在编译后的二进制文件中的易失性读取之后。

However, you might take note of the fact that if you follow the above link, there is some debate in the comments as to whether or not acquire/release semantics actuallyapply in this case.

但是,您可能会注意到一个事实,如果您点击上述链接,则评论中存在一些关于获取/释放语义是否实际适用于这种情况的争论。

回答by zeuxcg

(Editor's note: in C++11 volatileis not the right tool for this joband still has data-race UB. Use std::atomic<bool>with std::memory_order_relaxedloads/stores to do this without UB. On real implementations it will compile to the same asm as volatile. I added an answerwith more detail, and also addressing the misconceptions in comments that weakly-ordered memory might be a problem for this use-case: all real-world CPUs have coherent shared memory so volatilewill work for thison real C++ implementations. But still don't do it.

(编者注:在C ++ 11volatile是不是这个工作的工具,仍然有数据争UB使用。std::atomic<bool>std::memory_order_relaxed加载/存储做到不UB在真正实现它将编译到相同的ASM的,volatile我补充道。答案与更多的细节,也解决了误解在评论认为弱有序存储可能是该用例的一个问题:所有现实世界的CPU有连贯共享内存中,因此volatile将努力为这个在真正的C ++实现,但还是不要。不要这样做。

Some discussion in comments seems to be talking about other use-cases where you wouldneed something stronger than relaxed atomics. This answer already points out that volatilegives you no ordering.)

在评论一些讨论,似乎在谈论其他用情况下,您需要的东西比放松原子能强。这个答案已经指出,volatile没有给你排序。)



Volatile is occasionally useful for the following reason: this code:

由于以下原因,Volatile 有时很有用:这段代码:

/* global */ bool flag = false;

while (!flag) {}

is optimized by gcc to:

由 gcc 优化为:

if (!flag) { while (true) {} }

Which is obviously incorrect if the flag is written to by the other thread. Note that without this optimization the synchronization mechanism probably works (depending on the other code some memory barriers may be needed) - there is no need for a mutex in 1 producer - 1 consumer scenario.

如果标志是由另一个线程写入的,这显然是不正确的。请注意,如果没有这种优化,同步机制可能会起作用(取决于其他代码,可能需要一些内存屏障) - 在 1 个生产者 - 1 个消费者场景中不需要互斥锁。

Otherwise the volatile keyword is too weird to be useable - it does not provide any memory ordering guarantees wrt both volatile and non-volatile accesses and does not provide any atomic operations - i.e. you get no help from the compiler with volatile keyword except disabled register caching.

否则 volatile 关键字太奇怪而无法使用 - 它不提供任何内存排序保证 wrt 易失性和非易失性访问,并且不提供任何原子操作 - 即,除了禁用寄存器缓存之外,您无法从编译器获得 volatile 关键字的帮助.

回答by Peter Cordes

In C++11, normally never use volatilefor threading, only for MMIO

在 C++11 中,通常从不volatile用于线程,仅用于 MMIO

But TL:DR, it does "work" sort of like atomic with mo_relaxedon hardware with coherent caches (i.e. everything); it is sufficient to stop compilers keeping vars in registers. atomicdoesn't need memory barriers to create atomicity or inter-thread visibility, only to make the current thread wait before/after an operation to create ordering between this thread's accesses to different variables. mo_relaxednever needs any barriers, just load, store, or RMW.

但是 TL:DR,它mo_relaxed在具有一致缓存(即一切)的硬件上确实像原子一样“工作” ;阻止编译器将变量保存在寄存器中就足够了。 atomic不需要内存屏障来创建原子性或线程间可见性,只需让当前线程在操作之前/之后等待以在该线程对不同变量的访问之间创建排序。 mo_relaxed不需要任何障碍,只需加载、存储或 RMW。

For roll-your-own atomics with volatile(and inline-asm for barriers) in the bad old days before C++11 std::atomic, volatilewas the only good way to get some things to work. But it depended on a lot of assumptions about how implementations worked and was never guaranteed by any standard.

对于在 C++11 之前的糟糕日子中使用volatile(以及用于障碍的内联汇编)滚动你自己的原子是使某些事情工作的唯一好方法。但这取决于很多关于实现如何工作的假设,并且从未得到任何标准的保证。std::atomicvolatile

For example the Linux kernel still uses its own hand-rolled atomics with volatile, but only supports a few specific C implementations (GNU C, clang, and maybe ICC). Partly that's because of GNU C extensions and inline asm syntax and semantics, but also because it depends on some assumptions about how compilers work.

例如,Linux 内核仍然使用自己的手动原子和volatile,但只支持少数特定的 C 实现(GNU C、clang 和 ICC)。部分原因是因为 GNU C 扩展和内联 asm 语法和语义,但也因为它取决于关于编译器如何工作的一些假设。

It's almost always the wrong choice for new projects; you can use std::atomic(with std::memory_order_relaxed) to get a compiler to emit the same efficient machine code you could with volatile. std::atomicwith mo_relaxedobsoletes volatilefor threading purposes.(except maybe to work around missed-optimization bugs with atomic<double>on some compilers.)

对于新项目来说,这几乎总是错误的选择;您可以使用std::atomic(with std::memory_order_relaxed) 使编译器发出与volatile. std::atomicmo_relaxed过时volatile的线程目的。(除了可能在某些编译器上解决错过的优化错误atomic<double>。)

The internal implementation of std::atomicon mainstream compilers (like gcc and clang) does notjust use volatileinternally; compilers directly expose atomic load, store and RMW builtin functions. (e.g. GNU C __atomicbuiltinswhich operate on "plain" objects.)

std::atomic主流编译器(如 gcc 和 clang)的内部实现不仅仅volatile内部使用;编译器直接公开原子加载、存储和 RMW 内置函数。(例如,在“普通”对象上运行的GNU C__atomic内置函数。)



Volatile is usable in practice (but don't do it)

Volatile 在实践中是可用的(但不要这样做)

That said, volatileis usable in practice for things like an exit_nowflag on all(?) existing C++ implementations on real CPUs, because of how CPUs work (coherent caches) and shared assumptions about how volatileshould work. But not much else, and is notrecommended. The purpose of this answer is to explain how existing CPUs and C++ implementations actually work. If you don't care about that, all you need to know is that std::atomicwith mo_relaxed obsoletes volatilefor threading.

也就是说,由于 CPU 的工作方式(连贯缓存)和关于应该如何工作的共享假设,因此volatile在实践中可用于诸如exit_now真实 CPU 上所有(?)现有 C++ 实现上的标志之类的事情volatile。但其他的不多,也不推荐。 此答案的目的是解释现有 CPU 和 C++ 实现的实际工作方式。如果您不关心这一点,您只需要知道std::atomicmo_relaxed 已过时volatile用于线程处理。

(The ISO C++ standard is pretty vague on it, just saying that volatileaccesses should be evaluated strictly according to the rules of the C++ abstract machine, not optimized away. Given that real implementations use the machine's memory address-space to model C++ address space, this means volatilereads and assignments have to compile to load/store instructions to access the object-representation in memory.)

(ISO C++ 标准对此相当模糊,只是说volatile访问应该严格按照 C++ 抽象机的规则进行评估,而不是优化掉。鉴于实际实现使用机器的内存地址空间来建模 C++ 地址空间,这意味着volatile读取和赋值必须编译以加载/存储指令以访问内存中的对象表示。)



As another answer points out, an exit_nowflag is a simple case of inter-thread communication that doesn't need any synchronization: it's not publishing that array contents are ready or anything like that. Just a store that's noticed promptly by a not-optimized-away load in another thread.

正如另一个答案指出的那样,exit_now标志是不需要任何同步的线程间通信的简单情况:它不发布数组内容已准备好或类似的内容。只是一个被另一个线程中未优化的负载迅速注意到的商店。

    // global
    bool exit_now = false;

    // in one thread
    while (!exit_now) { do_stuff; }

    // in another thread, or signal handler in this thread
    exit_now = true;

Without volatile or atomic, the as-if rule and assumption of no data-race UB allows a compiler to optimize it into asm that only checks the flag once, before entering (or not) an infinite loop. This is exactly what happens in real life for real compilers. (And usually optimize away much of do_stuffbecause the loop never exits, so any later code that might have used the result is not reachable if we enter the loop).

在没有 volatile 或 atomic的情况下,无数据竞争 UB 的 as-if 规则和假设允许编译器将其优化为仅检查一次标志的 asm,然后再进入(或不)无限循环。这正是真实编译器在现实生活中发生的事情。(并且通常优化掉大部分,do_stuff因为循环永远不会退出,所以如果我们进入循环,任何可能使用结果的后续代码都无法访问)。

 // Optimizing compilers transform the loop into asm like this
    if (!exit_now) {        // check once before entering loop
        while(1) do_stuff;  // infinite loop
    }

Multithreading program stuck in optimized mode but runs normally in -O0is an example (with description of GCC's asm output) of how exactly this happens with GCC on x86-64. Also MCU programming - C++ O2 optimization breaks while loopon electronics.SE shows another example.

多线程程序卡在优化模式但在 -O0 中正常运行是一个例子(带有 GCC 的 asm 输出的描述),这是在 x86-64 上 GCC 是如何发生的。此外,MCU 编程 - C++ O2 优化在electronics.SE 上的循环中断显示了另一个示例。

We normally wantaggressive optimizations that CSE and hoist loads out of loops, including for global variables.

我们通常希望CSE 和提升从循环中加载的积极优化,包括全局变量。

Before C++11, volatile bool exit_nowwas one wayto make this work as intended (on normal C++ implementations). But in C++11, data-race UB still applies to volatileso it's not actually guaranteedby the ISO standard to work everywhere, even assuming HW coherent caches.

在 C++11 之前,volatile bool exit_now使这项工作按预期工作的一种方法(在普通的 C++ 实现上)。但是在 C++11 中,数据竞争 UB 仍然适用,volatile因此ISO 标准实际上并不能保证它可以在任何地方工作,即使假设硬件一致缓存也是如此。

Note that for wider types, volatilegives no guarantee of lack of tearing. I ignored that distinction here for boolbecause it's a non-issue on normal implementations. But that's also part of why volatileis still subject to data-race UB instead of being equivalent to relaxed atomic.

请注意,对于较宽的类型,volatile不能保证不会撕裂。我在这里忽略了这种区别,bool因为它在正常实现中不是问题。但这也是为什么volatile仍然受制于数据竞争 UB 而不是等同于放松原子的部分原因。

Note that "as intended" doesn't mean the thread doing exit_nowwaits for the other thread to actually exit. Or even that it waits for the volatile exit_now=truestore to even be globally visible before continuing to later operations in this thread. (atomic<bool>with the default mo_seq_cstwould make it wait before any later seq_cst loads at least. On many ISAs you'd just get a full barrier after the store).

请注意,“按预期”并不意味着线程正在exit_now等待另一个线程实际退出。或者甚至exit_now=true在继续此线程中的后续操作之前,它甚至等待易失性存储甚至全局可见。(atomic<bool>默认情况下mo_seq_cst,它至少会在任何以后的 seq_cst 加载之前等待。在许多 ISA 上,您只会在存储后获得一个完整的障碍)。

C++11 provides a non-UB way that compiles the same

C++11 提供了一种非 UB 方式来编译相同的

A "keep running" or "exit now" flag should use std::atomic<bool> flagwith mo_relaxed

“继续运行”或“立即退出”标志应std::atomic<bool> flagmo_relaxed

Using

使用

  • flag.store(true, std::memory_order_relaxed)
  • while( !flag.load(std::memory_order_relaxed) ) { ... }
  • flag.store(true, std::memory_order_relaxed)
  • while( !flag.load(std::memory_order_relaxed) ) { ... }

will give you the exact same asm (with no expensive barrier instructions) that you'd get from volatile flag.

将为您提供与您从volatile flag.

As well as no-tearing, atomicalso gives you the ability to store in one thread and load in another without UB, so the compiler can't hoist the load out of a loop. (The assumption of no data-race UB is what allows the aggressive optimizations we want for non-atomic non-volatile objects.) This feature of atomic<T>is pretty much the same as what volatiledoes for pure loads and pure stores.

除了无撕裂之外,atomic还使您能够在没有 UB 的情况下在一个线程中存储并在另一个线程中加载,因此编译器无法将负载提升到循环之外。(没有数据竞争 UB 的假设允许我们对非原子非易失性对象进行积极的优化。)这个特性与纯加载和纯存储的特性atomic<T>几乎相同volatile

atomic<T>also make +=and so on into atomic RMW operations (significantly more expensive than an atomic load into a temporary, operate, then a separate atomic store. If you don't want an atomic RMW, write your code with a local temporary).

atomic<T>也将 make+=等转化为原子 RMW 操作(比原子加载到临时、操作和单独的原子存储中要昂贵得多。如果您不想要原子 RMW,请使用本地临时代码编写代码)。

With the default seq_cstordering you'd get from while(!flag), it also adds ordering guarantees wrt. non-atomic accesses, and to other atomic accesses.

使用seq_cst您从 获得的默认排序while(!flag),它还添加了排序保证 wrt。非原子访问,以及其他原子访问。

(In theory, the ISO C++ standard doesn't rule out compile-time optimization of atomics. But in practice compilers don'tbecause there's no way to control when that wouldn't be ok. There are a few cases where even volatile atomic<T>might not be enough control over optimization of atomics if compilers did optimize, so for now compilers don't. See Why don't compilers merge redundant std::atomic writes?Note that wg21/p0062 recommends against using volatile atomicin current code to guard against optimization of atomics.)

(理论上,ISO C ++标准不排除原子能公司的编译时优化,但实际上编译器不会因为有没有办法控制的情况下,不会好的。有一些情况下甚至volatile atomic<T>可能不如果编译器确实优化了对原子优化的足够控制,那么现在编译器不会。请参阅为什么编译器不合并冗余 std::atomic 写入?请注意,wg21/p0062 建议不要volatile atomic在当前代码中使用以防止优化原子。)



volatiledoes actually work for this on real CPUs (but still don't use it)

volatile在真正的 CPU 上确实适用于这个(但仍然不使用它)

even with weakly-ordered memory models (non-x86). But don't actually use it, use atomic<T>with mo_relaxedinstead!! The point of this section is to address misconceptions about how real CPUs work, not to justify volatile. If you're writing lockless code, you probably care about performance. Understanding caches and the costs of inter-thread communication is usually important for good performance.

即使使用弱排序的内存模型(非 x86)。但实际上不要使用它,而是使用atomic<T>with mo_relaxed!本节的重点是解决对真实 CPU 工作方式的误解,而不是证明volatile. 如果您正在编写无锁代码,您可能会关心性能。了解缓存和线程间通信的成本对于良好的性能通常很重要。

Real CPUs have coherent caches / shared memory: after a store from one core becomes globally visible, no other core can loada stale value.(See also Myths Programmers Believe about CPU Cacheswhich talks some about Java volatiles, equivalent to C++ atomic<T>with seq_cst memory order.)

真正的 CPU 具有一致的缓存/共享内存:在一个内核的存储变得全局可见后,其他内核无法加载过时的值。(另请参阅Myths Programmersatomic<T>Being about CPU Caches,其中讨论了一些 Java volatiles,相当于具有 seq_cst 内存顺序的C++ 。)

When I say load, I mean an asm instruction that accesses memory. That's what a volatileaccess ensures, and is notthe same thing as lvalue-to-rvalue conversion of a non-atomic / non-volatile C++ variable. (e.g. local_tmp = flagor while(!flag)).

当我说load 时,我的意思是一条访问内存的 asm 指令。这就是一个volatile访问确保,并且是一样的东西左值到右值非原子/非易失性C ++变量的转换。(例如local_tmp = flagwhile(!flag))。

The only thing you need to defeat is compile-time optimizations that don't reload at all after the first check. Any load+check on each iteration is sufficient, without any ordering. Without synchronization between this thread and the main thread, it's not meaningful to talk about when exactly the store happened, or ordering of the load wrt. other operations in the loop. Only when it's visible to this threadis what matters. When you see the exit_now flag set, you exit. Inter-core latency on a typical x86 Xeon can be something like 40ns between separate physical cores.

您唯一需要克服的是编译时优化,在第一次检查后根本不会重新加载。每次迭代的任何负载+检查就足够了,无需任何排序。如果这个线程和主线程之间没有同步,那么谈论存储发生的确切时间或负载的顺序是没有意义的。循环中的其他操作。只有当它对该线程可见时才重要。当您看到 exit_now 标志设置时,您退出。典型的 x86 Xeon 上的内核间延迟在不同的物理内核之间可能大约40ns



In theory: C++ threads on hardware without coherent caches

理论上:硬件上的 C++ 线程没有一致的缓存

I don't see any way this could be remotely efficient, with just pure ISO C++ without requiring the programmer to do explicit flushes in the source code.

我看不出有任何方法可以远程高效,仅使用纯 ISO C++ 而无需程序员在源代码中进行显式刷新。

In theory you could have a C++ implementation on a machine that wasn't like this, requiring compiler-generated explicit flushes to make things visible to other threads on other cores. (Or for reads to not use a maybe-stale copy). The C++ standard doesn't make this impossible, but C++'s memory model is designed around being efficient on coherent shared-memory machines. E.g. the C++ standard even talks about "read-read coherence", "write-read coherence", etc. One note in the standard even points the connection to hardware:

理论上,您可以在与此不同的机器上实现 C++ 实现,需要编译器生成的显式刷新以使其他内核上的其他线程可见。(或者读取不使用可能过时的副本)。C++ 标准并没有让这成为不可能,但是 C++ 的内存模型是围绕在一致的共享内存机器上高效而设计的。例如,C++ 标准甚至谈到了“读-读一致性”、“写-读一致性”等。标准中的一个注释甚至指出了与硬件的连接:

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations.— end note ]

http://eel.is/c++draft/intro.races#19

[注意:前面的四个一致性要求有效地禁止编译器将原子操作重新排序为单个对象,即使这两个操作都是宽松加载。这有效地使大多数硬件提供的缓存一致性保证可用于 C++ 原子操作。— 尾注 ]

There's no mechanism for a releasestore to only flush itself and a few select address-ranges: it would have to sync everything because it wouldn't know what other threads might want to read if their acquire-load saw this release-store (forming a release-sequence that establishes a happens-before relationship across threads, guaranteeing that earlier non-atomic operations done by the writing thread are now safe to read. Unless it did further writes to them after the release store...) Or compilers would have to be reallysmart to prove that only a few cache lines needed flushing.

没有机制让release存储只刷新自身和一些选择的地址范围:它必须同步所有内容,因为如果他们的获取加载看到这个发布存储(形成一个release-sequence 在线程之间建立了一个happens-before 关系,保证早期由写线程完成的非原子操作现在可以安全读取。除非它在发布存储之后对它们进行了进一步的写入......)否则编译器会是真正聪明的证明,只有少数的高速缓存行需要冲洗。

Related: my answer on Is mov + mfence safe on NUMA?goes into detail about the non-existence of x86 systems without coherent shared memory. Also related: Loads and stores reordering on ARMfor more about loads/stores to the samelocation.

相关:我的回答是 mov + mfence 在 NUMA是否安全?详细介绍了没有一致共享内存的 x86 系统是不存在的。还相关:加载和存储在 ARM 上重新排序,了解有关加载/存储到同一位置的更多信息。

There areI think clusters with non-coherent shared memory, but they're not single-system-image machines. Each coherency domain runs a separate kernel, so you can't run threads of a single C++ program across it. Instead you run separate instances of the program (each with their own address space: pointers in one instance aren't valid in the other).

这里我想用非相干共享存储集群,但他们不是单一系统映像机器。每个一致性域都运行一个单独的内核,因此您不能在其上运行单个 C++ 程序的线程。相反,您运行程序的单独实例(每个实例都有自己的地址空间:一个实例中的指针在另一个实例中无效)。

To get them to communicate with each other via explicit flushes, you'd typically use MPI or other message-passing API to make the program specify which address ranges need flushing.

为了让它们通过显式刷新相互通信,您通常会使用 MPI 或其他消息传递 API 来让程序指定哪些地址范围需要刷新。



Real hardware doesn't run std::threadacross cache coherency boundaries:

真正的硬件不会std::thread跨越缓存一致性边界运行:

Some asymmetric ARM chips exist, with shared physical address space but notinner-shareable cache domains. So not coherent. (e.g. comment threadan A8 core and an Cortex-M3 like TI Sitara AM335x).

存在一些非对称 ARM 芯片,它们具有共享的物理地址空间,但没有内部可共享的缓存域。所以不连贯。(例如,注释线程A8 内核和 Cortex-M3,如 TI Sitara AM335x)。

But different kernels would run on those cores, not a single system image that could run threads across both cores. I'm not aware of any C++ implementations that run std::threadthreads across CPU cores without coherent caches.

但是不同的内核会在这些内核上运行,而不是可以在两个内核上运行线程的单个系统映像。我不知道有任何 C++ 实现std::thread在没有一致缓存的情况下跨 CPU 内核运行线程。

For ARM specifically, GCC and clang generate code assuming all threads run in the same inner-shareable domain. In fact, the ARMv7 ISA manual says

特别是对于 ARM,GCC 和 clang 生成代码假设所有线程都在同一个内部可共享域中运行。事实上,ARMv7 ISA手册说

This architecture (ARMv7) is written with an expectation that all processors using the same operating system or hypervisor are in the same Inner Shareable shareability domain

此架构 (ARMv7) 的编写预期所有使用相同操作系统或虚拟机管理程序的处理器都在同一个内部共享共享域中

So non-coherent shared memory between separate domains is only a thing for explicit system-specific use of shared memory regions for communication between different processes under different kernels.

因此,不同域之间的非连贯共享内存只是为了在不同内核下的不同进程之间进行通信的共享内存区域的显式系统特定使用。

See also this CoreCLRdiscussion about code-gen using dmb ish(Inner Shareable barrier) vs. dmb sy(System) memory barriers in that compiler.

另请参阅有关在该编译器中使用(内部共享屏障)与(系统)内存屏障的代码生成的CoreCLR讨论。dmb ishdmb sy

I make the assertion that no C++ implementation for other any other ISA runs std::threadacross cores with non-coherent caches.I don't have proof that no such implementation exists, but it seems highly unlikely. Unless you're targeting a specific exotic piece of HW that works that way, your thinking about performance should assume MESI-like cache coherency between all threads. (Preferably use atomic<T>in ways that guarantees correctness, though!)

我断言其他任何其他 ISA 的 C++ 实现都不会std::thread跨具有非一致性缓存的内核运行。我没有证据表明不存在这样的实现,但这似乎不太可能。除非您针对以这种方式工作的特定奇特硬件,否则您对性能的考虑应该假设所有线程之间具有类似 MESI 的缓存一致性。(不过,最好atomic<T>以保证正确性的方式使用!)



Coherent caches makes it simple

相干缓存让一切变得简单

But on a multi-core system with coherent caches, implementing a release-storejust means ordering commit into cache for this thread's stores, not doing any explicit flushing. (https://preshing.com/20120913/acquire-and-release-semantics/and https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). (And an acquire-load means ordering access to cache in the other core).

但是在具有一致缓存的多核系统上,实现发布存储只是意味着将提交提交到该线程的存储的缓存中,而不是进行任何显式刷新。(https://preshing.com/20120913/acquire-and-release-semantics/https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/)。(并且获取加载意味着命令访问另一个内核中的缓存)。

A memory barrier instruction just blocks the current thread's loads and/or stores until the store buffer drains; that always happens as fast as possible on its own.(Does a memory barrier ensure that the cache coherence has been completed?addresses this misconception). So if you don't need ordering, just prompt visibility in other threads, mo_relaxedis fine. (And so is volatile, but don't do that.)

内存屏障指令只会阻塞当前线程的加载和/或存储,直到存储缓冲区耗尽;这总是尽可能快地发生。内存屏障是否确保缓存一致性已经完成?解决了这个误解)。因此,如果您不需要订购,只需在其他线程中提示可见性即可mo_relaxed。(也是volatile,但不要那样做。)

See also C/C++11 mappings to processors

另请参见C/C++11 到处理器的映射

Fun fact: on x86, every asm store is a release-store because the x86 memory model is basically seq-cst plus a store buffer (with store forwarding).

有趣的事实:在 x86 上,每个 asm 存储都是一个释放存储,因为 x86 内存模型基本上是 seq-cst 加上存储缓冲区(带有存储转发)。



Semi-related re: store buffer, global visibility, and coherency: C++11 guarantees very little. Most real ISAs (except PowerPC) do guarantee that all threads can agree on the order of a appearance of two stores by two other threads. (In formal computer-architecture memory model terminology, they're "multi-copy atomic").

半相关的 re:存储缓冲区、全局可见性和一致性:C++11 保证很少。大多数真正的 ISA(PowerPC 除外)确实保证所有线程都可以就另外两个线程出现两个存储的顺序达成一致。(在正式的计算机架构内存模型术语中,它们是“多副本原子”)。

Another misconception is that memory fence asm instructions are needed to flush the store buffer for other cores to see our stores at all. Actually the store buffer is always trying to drain itself (commit to L1d cache) as fast as possible, otherwise it would fill up and stall execution. What a full barrier / fence does is stall the current thread until the store buffer is drained, so our later loads appear in the global order after our earlier stores.

另一个误解是需要的存储栅栏汇编指令刷新存储缓冲区其他处理器看到我们的店在所有。实际上,存储缓冲区总是试图尽可能快地耗尽自己(提交到 L1d 缓存),否则它会填满并停止执行。一个完整的屏障/栅栏的作用是停止当前线程,直到存储缓冲区被耗尽,所以我们后面的加载出现在我们之前的存储之后的全局顺序中。

(x86's strongly ordered asm memory model means that volatileon x86 may end up giving you closer to mo_acq_rel, except that compile-time reordering with non-atomic variables can still happen. But most non-x86 have weakly-ordered memory models so volatileand relaxedare about as weak as mo_relaxedallows.)

(X86,强烈下令ASM存储模式意味着volatile在x86最终可能让你更接近mo_acq_rel,不同的是编译时与非原子变量仍可能发生重新排序,但大多数非x86已经弱有序内存模型,以便volatilerelaxed大约为在mo_relaxed允许的情况下弱。)

回答by Anu Siril

#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;

bool checkValue = false;

int main()
{
    std::thread writer([&](){
            sleep(2);
            checkValue = true;
            std::cout << "Value of checkValue set to " << checkValue << std::endl;
        });

    std::thread reader([&](){
            while(!checkValue);
        });

    writer.join();
    reader.join();
}

Once an interviewer who also believed that volatile is useless argued with me that Optimisation wouldn't cause any issues and was referring to different cores having separate cache lines and all that (didn't really understand what he was exactly referring to). But this piece of code when compiled with -O3 on g++ (g++ -O3 thread.cpp -lpthread), it shows undefined behaviour. Basically if the value gets set before the while check it works fine and if not it goes into a loop without bothering to fetch the value (which was actually changed by the other thread). Basically i believe the value of checkValue only gets fetched once into the register and never gets checked again under the highest level of optimisation. If its set to true before the fetch, it works fine and if not it goes into a loop. Please correct me if am wrong.

曾经有一位面试官也认为 volatile 是无用的,他与我争辩说优化不会引起任何问题,并且指的是具有单独缓存线的不同内核等等(并没有真正理解他到底指的是什么)。但是这段代码在 g++ (g++ -O3 thread.cpp -lpthread) 上用 -O3 编译时,它显示了未定义的行为。基本上,如果在 while 检查之前设置了值,它就可以正常工作,如果没有,它就会进入一个循环而无需费心获取值(实际上是由另一个线程更改的)。基本上我相信 checkValue 的值只会被提取一次到寄存器中,并且在最高级别的优化下永远不会被再次检查。如果在获取之前将其设置为 true,则它可以正常工作,否则将进入循环。如果错了,请纠正我。

回答by ctrl-alt-delor

You need volatile and possibly locking.

您需要 volatile 并且可能需要锁定。

volatile tells the optimiser that the value can change asynchronously, thus

volatile 告诉优化器该值可以异步更改,因此

volatile bool flag = false;

while (!flag) {
    /*do something*/
}

will read flag every time around the loop.

每次循环都会读取标志。

If you turn optimisation off or make every variable volatile a program will behave the same but slower. volatile just means 'I know you may have just read it and know what it says, but if I say read it then read it.

如果关闭优化或使每个变量都可变,则程序的行为将相同但速度较慢。volatile 只是意味着'我知道你可能刚刚读过它并知道它说了什么,但如果我说读它然后读它。

Locking is a part of the program. So ,by the way, if you are implementing semaphores then among other things they must be volatile. (Don't try it, it is hard, will probably need a little assembler or the new atomic stuff, and it has already been done.)

锁定是程序的一部分。所以,顺便说一句,如果你正在实现信号量,那么它们必须是易变的。(不要尝试,这很难,可能需要一点汇编程序或新的原子性东西,而且已经完成了。)