C++ rdtscp、rdtsc 之间的区别：内存和 cpuid / rdtsc？

Question

提问by Steve Lorimer

Assume we're trying to use the tsc for performance monitoring and we we want to prevent instruction reordering.

假设我们正在尝试使用 tsc 进行性能监控，并且我们希望防止指令重新排序。

These are our options:

这些是我们的选择：

1:rdtscpis a serializing call. It prevents reordering around the call to rdtscp.

1：rdtscp是一个序列化调用。它可以防止围绕对 rdtscp 的调用重新排序。

__asm__ __volatile__("rdtscp; "         // serializing read of tsc
                     "shl ,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc variable
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

However, rdtscpis only available on newer CPUs. So in this case we have to use rdtsc. But rdtscis non-serializing, so using it alone will not prevent the CPU from reordering it.

但是，rdtscp仅在较新的 CPU 上可用。所以在这种情况下，我们必须使用rdtsc. 但是rdtsc是非序列化的，因此单独使用它不会阻止 CPU 对其重新排序。

So we can use either of these two options to prevent reordering:

所以我们可以使用这两个选项中的任何一个来防止重新排序：

2:This is a call to cpuidand then rdtsc. cpuidis a serializing call.

2：这是对cpuid然后的调用rdtsc。cpuid是一个序列化调用。

volatile int dont_remove __attribute__((unused)); // volatile to stop optimizing
unsigned tmp;
__cpuid(0, tmp, tmp, tmp, tmp);                   // cpuid is a serialising call
dont_remove = tmp;                                // prevent optimizing out cpuid

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl ,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx"); // rcx and rdx are clobbered

3:This is a call to rdtscwith memoryin the clobber list, which prevents reordering

3：这是在clobber列表中调用rdtscwith memory，防止重新排序

__asm__ __volatile__("rdtsc; "          // read of tsc
                     "shl ,%%rdx; "  // shift higher 32 bits stored in rdx up
                     "or %%rdx,%%rax"   // and or onto rax
                     : "=a"(tsc)        // output to tsc
                     :
                     : "%rcx", "%rdx", "memory"); // rcx and rdx are clobbered
                                                  // memory to prevent reordering

My understanding for the 3rd option is as follows:

我对第三个选项的理解如下：

Making the call __volatile__prevents the optimizer from removing the asm or moving it across any instructions that could need the results (or change the inputs) of the asm. However it could still move it with respect to unrelated operations. So __volatile__is not enough.

进行调用__volatile__可防止优化器删除 asm 或将其移动到任何可能需要 asm 结果（或更改输入）的指令中。但是，它仍然可以针对不相关的操作移动它。所以__volatile__还不够。

Tell the compiler memory is being clobbered: : "memory"). The "memory"clobber means that GCC cannot make any assumptions about memory contents remaining the same across the asm, and thus will not reorder around it.

告诉编译器内存被破坏：: "memory"). 该"memory"撞意味着GCC不能做任何假设有关内存的内容跨越ASM保持不变，因此，围绕它不会重新排序。

So my questions are:

所以我的问题是：

1: Is my understanding of __volatile__and "memory"correct?
2: Do the second two calls do the same thing?
3: Using "memory"looks much simpler than using another serializing instruction. Why would anyone use the 3rd option over the 2nd option?

1：我的理解__volatile__和"memory"正确吗？
2：后两个调用做同样的事情吗？
3：使用"memory"看起来比使用另一个序列化指令简单得多。为什么有人会使用第三个选项而不是第二个选项？

Answer 1

采纳答案by janneb

As mentioned in a comment, there's a difference between a compiler barrierand a processor barrier. volatileand memoryin the asm statement act as a compiler barrier, but the processor is still free to reorder instructions.

正如评论中提到的，编译器屏障和处理器屏障之间存在差异。volatile并且memory在 asm 语句中充当编译器屏障，但处理器仍然可以自由地重新排序指令。

Processor barrier are special instructions that must be explicitly given, e.g. rdtscp, cpuid, memory fence instructions (mfence, lfence,...) etc.

处理器屏障是必须明确给出的特殊指令，例如rdtscp, cpuid，内存栅栏指令 ( mfence, lfence,...) 等。

As an aside, while using cpuidas a barrier before rdtscis common, it can also be very bad from a performance perspective, since virtual machine platforms often trap and emulate the cpuidinstruction in order to impose a common set of CPU features across multiple machines in a cluster (to ensure that live migration works). Thus it's better to use one of the memory fence instructions.

顺便cpuid说一句，虽然以前用作屏障rdtsc很常见，但从性能角度来看，它也可能非常糟糕，因为虚拟机平台经常捕获和模拟cpuid指令，以便在集群中的多台机器上强加一组通用的 CPU 功能（以确保实时迁移有效）。因此最好使用内存栅栏指令之一。

The Linux kernel uses mfence;rdtscon AMD platforms and lfence;rdtscon Intel. If you don't want to bother with distinguishing between these, mfence;rdtscworks on both although it's slightly slower as mfenceis a stronger barrier than lfence.

Linux 内核mfence;rdtsc在 AMD 平台和lfence;rdtscIntel 上使用。如果您不想费心区分这些，mfence;rdtsc可以同时使用两者，尽管它mfence比lfence.

Edit 2019-11-25: As of Linux kernel 5.4, lfence is used to serialize rdtsc on both Intel and AMD. See this commit "x86: Remove X86_FEATURE_MFENCE_RDTSC": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be261ffce6f13229dad50f59c5e491f933d3167f

编辑 2019-11-25：从 Linux 内核 5.4 开始，lfence 用于在 Intel 和 AMD 上序列化 rdtsc。请参阅此提交“x86：删除 X86_FEATURE_MFENCE_RDTSC”：https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ ?id =be261ffce6f13229dad50f59c5e491f933d3167f

Answer 2

回答by Pranjal Verma

you can use it like shown below:

你可以像下图那样使用它：

asm volatile (
"CPUID\n\t"/*serialize*/
"RDTSC\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r"
(cycles_low):: "%rax", "%rbx", "%rcx", "%rdx");
/*
Call the function to benchmark
*/
asm volatile (
"RDTSCP\n\t"/*read the clock*/
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t": "=r" (cycles_high1), "=r"
(cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx");

In the code above, the first CPUID call implements a barrier to avoid out-of-order execution of the instructions above and below the RDTSC instruction. With this method we avoid to call a CPUID instruction in between the reads of the real-time registers

在上面的代码中，第一个 CPUID 调用实现了一个屏障，以避免无序执行 RDTSC 指令上方和下方的指令。使用这种方法，我们避免在读取实时寄存器之间调用 CPUID 指令

The first RDTSC then reads the timestamp register and the value is stored in memory. Then the code that we want to measure is executed. The RDTSCP instruction reads the timestamp register for the second time and guarantees that the execution of all the code we wanted to measure is completed. The two “mov” instructions coming afterwards store the edx and eax registers values into memory. Finally a CPUID call guarantees that a barrier is implemented again so that it is impossible that any instruction coming afterwards is executed before CPUID itself.

然后第一个 RDTSC 读取时间戳寄存器并将值存储在内存中。然后执行我们要测量的代码。RDTSCP指令第二次读取时间戳寄存器，保证我们要测量的所有代码都执行完毕。随后出现的两条“mov”指令将 edx 和 eax 寄存器值存储到内存中。最后，CPUID 调用保证了屏障再次实现，因此之后的任何指令都不可能在 CPUID 本身之前执行。

C++ rdtscp、rdtsc 之间的区别：内存和 cpuid / rdtsc？

提问by Steve Lorimer

采纳答案by janneb

回答by Pranjal Verma

相关推荐

最近更新

标签

C++ rdtscp、rdtsc 之间的区别：内存和 cpuid / rdtsc？

提问by Steve Lorimer

采纳答案by janneb

回答by Pranjal Verma

相关推荐

为什么 C++11 有 `make_shared` 而没有 `make_unique`

使用 C++ 的 1 到 10 之间的随机数

在 C++ 中访问环境变量

C++ 枚举 vs 强类型枚举

相关推荐

最近更新

标签