C语言 GCC 内存屏障 __sync_synchronize 与 asm volatile("":: :"memory")

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19965076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 07:57:44  来源:igfitidea点击:

GCC memory barrier __sync_synchronize vs asm volatile("": : :"memory")

cgcc

提问by user964970

asm volatile("": : :"memory")is often used as a memory barrier (e.g. as seen in the Linux kernel barriermacro).

asm volatile("": : :"memory")通常用作内存屏障(例如在 Linux 内核barrier宏中看到的)。

This sounds similar to what the GCC builtin __sync_synchronizedoes.

这听起来类似于 GCC 内置__sync_synchronize函数的作用。

Are these two similar?

这两个相似吗?

If not, what are the differences, and when would one used over the other ?

如果不是,有什么区别,什么时候会使用另一个?

回答by Leeor

There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.

有一个显着的区别——第一个选项(内联汇编)实际上在运行时什么都不做,那里没有执行命令,CPU 也不知道。它仅在编译时起作用,告诉编译器不要将加载或存储移动到此点(以任何方向)作为其优化的一部分。它被称为 SW 屏障。

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.

第二个屏障(内置同步)将简单地转换为硬件屏障,如果您在 x86 上,可能是栅栏(mfence/sfence)操作,或者其他架构中的等效项。CPU 也可能在运行时做各种优化,最重要的实际上是乱序执行操作——这条指令告诉它确保加载或存储不能通过这一点,必须在正确的一侧观察同步点。

Here'sanother good explanation:

这是另一个很好的解释:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.

The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.

记忆障碍的类型

如上所述,编译器和处理器都可以以需要使用内存屏障的方式优化指令的执行。影响编译器和处理器的内存屏障是硬件内存屏障,只影响编译器的内存屏障是软件内存屏障。

除了硬件和软件内存屏障之外,内存屏障还可以限制为内存读取、内存写入或两者兼而有之。影响读取和写入的内存屏障是完整内存屏障。

还有一类内存屏障是多处理器环境特有的。这些内存屏障的名称以“smp”为前缀。在多处理器系统上,这些屏障是硬件内存屏障,而在单处理器系统上,它们是软件内存屏障。

barrier() 宏是唯一的软件内存屏障,它是一个完整的内存屏障。Linux 内核中的所有其他内存屏障都是硬件屏障。硬件内存屏障是隐含的软件屏障。

An example for when SW barrier is useful: consider the following code -

SW 屏障有用的示例:考虑以下代码 -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

这个经过优化编译的简单循环很可能会被展开和矢量化。这是 gcc 4.8.0 -O3 生成的打包(向量)操作的汇编代码:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    
400418:       83 00 01                addl   ##代码##x1,(%rax)
40041b:       48 83 c0 04             add    ##代码##x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>
x10,%rax 400428: 66 0f fe c1 paddd %xmm1,%xmm0 40042c: 66 0f 7f 40 f0 movdqa %xmm0,0xfffffffffffffff0(%rax) 400431: 48 39 d0 cmp %rdx,%rax 400434: 75 ea jne 400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:

但是,在每次迭代中添加内联程序集时,不允许 gcc 更改越过屏障的操作顺序,因此它无法对它们进行分组,并且程序集成为循环的标量版本:

##代码##

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

但是,当 CPU 执行此代码时,只要不破坏内存排序模型,就可以在“幕后”重新排序操作。这意味着可以乱序执行操作(如果 CPU 支持,就像现在大多数人所做的那样)。硬件围栏会阻止这种情况。

回答by Ivar Svendsen

A comment on the usefulness of SW-only barriers:

对纯软件障碍的有用性的评论:

On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. So you need to do things like SW spin-locks. The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms.

在一些微控制器和其他嵌入式平台上,您可能有多任务处理,但没有缓存系统或缓存延迟,因此没有硬件屏障指令。所以你需要做一些像 SW 自旋锁这样的事情。SW 屏障阻止了这些算法中的编译器优化(读/写组合和重新排序)。