C++ 如何在 GCC x86 中使用 RDTSC 计算时钟周期?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9887839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:22:27  来源:igfitidea点击:

How to count clock cycles with RDTSC in GCC x86?

c++cgccx86rdtsc

提问by Johan R?de

With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?

使用 Visual Studio,我可以从处理器读取时钟周期计数,如下所示。我如何用 GCC 做同样的事情?

#ifdef _MSC_VER             // Compiler: Microsoft Visual Studio

    #ifdef _M_IX86                      // Processor: x86

        inline uint64_t clockCycleCount()
        {
            uint64_t c;
            __asm {
                cpuid       // serialize processor
                rdtsc       // read time stamp counter
                mov dword ptr [c + 0], eax
                mov dword ptr [c + 4], edx
            }
            return c;
        }

    #elif defined(_M_X64)               // Processor: x64

        extern "C" unsigned __int64 __rdtsc();
        #pragma intrinsic(__rdtsc)
        inline uint64_t clockCycleCount()
        {
            return __rdtsc();
        }

    #endif

#endif

回答by Evan Shaw

The other answers work, but you can avoid inline assembly by using GCC's __rdtscintrinsic, available by including x86intrin.h.

其他答案有效,但您可以通过使用 GCC 的__rdtsc内在函数来避免内联汇编,包括x86intrin.h.

It is defined at: gcc/config/i386/ia32intrin.h:

它定义于gcc/config/i386/ia32intrin.h

/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
  return __builtin_ia32_rdtsc ();
}

回答by Andrew Tomazos

On recent versions of Linux gettimeofday will incorporate nanosecond timings.

在最新版本的 Linux 上,gettimeofday 将包含纳秒计时。

If you really want to call RDTSC you can use the following inline assembly:

如果你真的想调用 RDTSC,你可以使用以下内联程序集:

http://www.mcs.anl.gov/~kazutomo/rdtsc.html

http://www.mcs.anl.gov/~kazutomo/rdtsc.html

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#endif

回答by Peter Cordes

Update: reposted and updated this answeron a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtscquestions.

更新:在一个更规范的问题上重新发布并更新了这个答案。一旦我们整理出哪个问题用作关闭所有类似rdtsc问题的重复目标,我可能会在某个时候删除它。



You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtscand rdtscp, and (at least these days) all define a __rdtscintrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

您不需要也不应该为此使用内联 asm。没有任何好处;编译器内置了rdtscand rdtscp,并且(至少现在)__rdtsc如果包含正确的头文件,它们都会定义一个内在函数。 https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says#include <immintrin.h>for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

不幸的是,对于非 SIMD 内在函数使用哪个标头,MSVC 不同意其他所有人的意见。(英特尔的内部函数指南#include <immintrin.h>对此进行了说明,但是对于 gcc 和 clang,非 SIMD 内部函数主要在x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit.See the results on the Godbolt compiler explorer.

与所有 4 个主要编译器一起编译:gcc/clang/ICC/MSVC,适用于 32 位或 64 位。在 Godbolt 编译器浏览器上查看结果

For more about using lfenceto improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

有关lfence用于提高 的可重复性的更多信息rdtsc,请参阅 @HadiBrais 关于clflush to invalidate cache line via C function的回答。

See also Is LFENCE serializing on AMD processors?(TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)

另请参阅LFENCE 是否在 AMD 处理器上进行序列化?(TL:DR 是启用 Spectre 缓解,否则内核会保留相关的 MSR 未设置。)



rdtsccounts referencecycles, not CPU core clock cycles

rdtsc计数参考周期,而不是 CPU 内核时钟周期

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtscis exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

无论涡轮增压/节能如何,它都以固定频率计数,因此如果您想要每时钟 uops 分析,请使用性能计数器。 rdtsc与挂钟时间完全相关(系统时钟调整除外,因此基本上是steady_clock)。它在 CPU 的额定频率下滴答作响,即广告标贴频率。

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of programif your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

如果您将它用于微基准测试,请先包括一个预热期,以确保您的 CPU 在开始计时之前已经处于最大时钟速度。或者更好的,使用一个库,使您可以访问硬件性能计数器,或一招类似PERF的统计为计划的一部分,如果你的计时区是足够长的时间,你可以附上perf stat -p PID。不过,您通常仍希望在微基准测试期间避免 CPU 频率偏移。

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtscdirectly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogramon Linux.

也不保证所有内核的 TSC 同步。因此,如果您的线程迁移到 之间的另一个 CPU 内核__rdtsc(),则可能会出现额外的偏差。(不过,大多数操作系统尝试同步所有内核的 TSC。)如果您rdtsc直接使用,您可能希望将程序或线程固定到内核,例如taskset -c 0 ./myprogram在 Linux 上。



How good is the asm from using the intrinsic?

使用内在函数的 asm 有多好?

It's at least as good as anything you could do with inline asm.

它至少和你可以用内联 asm 做的任何事情一样好。

A non-inline version of it compiles MSVC for x86-64 like this:

它的非内联版本为 x86-64 编译 MSVC,如下所示:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

对于在 中返回 64 位整数的 32 位调用约定edx:eax,它只是rdtsc/ ret。这并不重要,你总是希望它内联。

In a test caller that uses it twice and subtracts to time an interval:

在使用它两次并减去时间间隔的测试调用者中:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

所有 4 个编译器都生成非常相似的代码。这是 GCC 的 32 位输出:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

这是 MSVC 的 x86-64 输出(应用了名称拆分)。gcc/clang/ICC 都发出相同的代码。

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+movinstead of leato combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

所有 4 个编译器都使用or+mov而不是lea将低半和高半组合到不同的寄存器中。我想这是他们未能优化的一种固定序列。

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

但是自己用 inline asm 写也好不到哪里去。如果您的时间间隔如此短以至于只保留 32 位结果,那么您将剥夺编译器忽略 EDX 中结果的高 32 位的机会。或者,如果编译器决定将开始时间存储到内存中,它可以只使用两个 32 位存储而不是 shift/或 /mov。如果 1 个额外的 uop 作为时间的一部分困扰着您,您最好用纯 asm 编写整个微基准测试。

回答by a3nm

On Linux with gcc, I use the following:

在带有 的 Linux 上gcc,我使用以下内容:

/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
  uint64_t x;
  __asm__ volatile ("rdtsc" : "=A" (x));
  return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}
#endif

/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed