C++ 如何在 GCC x86 中使用 RDTSC 计算时钟周期?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9887839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to count clock cycles with RDTSC in GCC x86?
提问by Johan R?de
With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?
使用 Visual Studio,我可以从处理器读取时钟周期计数,如下所示。我如何用 GCC 做同样的事情?
#ifdef _MSC_VER // Compiler: Microsoft Visual Studio
#ifdef _M_IX86 // Processor: x86
inline uint64_t clockCycleCount()
{
uint64_t c;
__asm {
cpuid // serialize processor
rdtsc // read time stamp counter
mov dword ptr [c + 0], eax
mov dword ptr [c + 4], edx
}
return c;
}
#elif defined(_M_X64) // Processor: x64
extern "C" unsigned __int64 __rdtsc();
#pragma intrinsic(__rdtsc)
inline uint64_t clockCycleCount()
{
return __rdtsc();
}
#endif
#endif
回答by Evan Shaw
The other answers work, but you can avoid inline assembly by using GCC's __rdtsc
intrinsic, available by including x86intrin.h
.
其他答案有效,但您可以通过使用 GCC 的__rdtsc
内在函数来避免内联汇编,包括x86intrin.h
.
It is defined at: gcc/config/i386/ia32intrin.h
:
它定义于gcc/config/i386/ia32intrin.h
:
/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
return __builtin_ia32_rdtsc ();
}
回答by Andrew Tomazos
On recent versions of Linux gettimeofday will incorporate nanosecond timings.
在最新版本的 Linux 上,gettimeofday 将包含纳秒计时。
If you really want to call RDTSC you can use the following inline assembly:
如果你真的想调用 RDTSC,你可以使用以下内联程序集:
http://www.mcs.anl.gov/~kazutomo/rdtsc.html
http://www.mcs.anl.gov/~kazutomo/rdtsc.html
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
回答by Peter Cordes
Update: reposted and updated this answeron a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc
questions.
更新:在一个更规范的问题上重新发布并更新了这个答案。一旦我们整理出哪个问题用作关闭所有类似rdtsc
问题的重复目标,我可能会在某个时候删除它。
You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc
and rdtscp
, and (at least these days) all define a __rdtsc
intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm
您不需要也不应该为此使用内联 asm。没有任何好处;编译器内置了rdtsc
and rdtscp
,并且(至少现在)__rdtsc
如果包含正确的头文件,它们都会定义一个内在函数。 https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says#include <immintrin.h>
for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h
.)
不幸的是,对于非 SIMD 内在函数使用哪个标头,MSVC 不同意其他所有人的意见。(英特尔的内部函数指南#include <immintrin.h>
对此进行了说明,但是对于 gcc 和 clang,非 SIMD 内部函数主要在x86intrin.h
.)
#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit.See the results on the Godbolt compiler explorer.
与所有 4 个主要编译器一起编译:gcc/clang/ICC/MSVC,适用于 32 位或 64 位。在 Godbolt 编译器浏览器上查看结果。
For more about using lfence
to improve repeatability of rdtsc
, see @HadiBrais' answer on clflush to invalidate cache line via C function.
有关lfence
用于提高 的可重复性的更多信息rdtsc
,请参阅 @HadiBrais 关于clflush to invalidate cache line via C function的回答。
See also Is LFENCE serializing on AMD processors?(TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)
另请参阅LFENCE 是否在 AMD 处理器上进行序列化?(TL:DR 是启用 Spectre 缓解,否则内核会保留相关的 MSR 未设置。)
rdtsc
counts referencecycles, not CPU core clock cycles
rdtsc
计数参考周期,而不是 CPU 内核时钟周期
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc
is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock
). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.
无论涡轮增压/节能如何,它都以固定频率计数,因此如果您想要每时钟 uops 分析,请使用性能计数器。 rdtsc
与挂钟时间完全相关(系统时钟调整除外,因此基本上是steady_clock
)。它在 CPU 的额定频率下滴答作响,即广告标贴频率。
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of programif your timed region is long enough that you can attach a perf stat -p PID
. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.
如果您将它用于微基准测试,请先包括一个预热期,以确保您的 CPU 在开始计时之前已经处于最大时钟速度。或者更好的,使用一个库,使您可以访问硬件性能计数器,或一招类似PERF的统计为计划的一部分,如果你的计时区是足够长的时间,你可以附上perf stat -p PID
。不过,您通常仍希望在微基准测试期间避免 CPU 频率偏移。
- std::chrono::clock, hardware clock and cycle count
- Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
- Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
- std::chrono::clock,硬件时钟和周期计数
- 使用 RDTSC 获取 cpu 周期 - 为什么 RDTSC 的值总是增加?
- 英特尔的周期丢失?rdtsc 和 CPU_CLK_UNHALTED.REF_TSC 之间的不一致
It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc()
, there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc
directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram
on Linux.
也不保证所有内核的 TSC 同步。因此,如果您的线程迁移到 之间的另一个 CPU 内核__rdtsc()
,则可能会出现额外的偏差。(不过,大多数操作系统尝试同步所有内核的 TSC。)如果您rdtsc
直接使用,您可能希望将程序或线程固定到内核,例如taskset -c 0 ./myprogram
在 Linux 上。
How good is the asm from using the intrinsic?
使用内在函数的 asm 有多好?
It's at least as good as anything you could do with inline asm.
它至少和你可以用内联 asm 做的任何事情一样好。
A non-inline version of it compiles MSVC for x86-64 like this:
它的非内联版本为 x86-64 编译 MSVC,如下所示:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax
, it's just rdtsc
/ret
. Not that it matters, you always want this to inline.
对于在 中返回 64 位整数的 32 位调用约定edx:eax
,它只是rdtsc
/ ret
。这并不重要,你总是希望它内联。
In a test caller that uses it twice and subtracts to time an interval:
在使用它两次并减去时间间隔的测试调用者中:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
所有 4 个编译器都生成非常相似的代码。这是 GCC 的 32 位输出:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
这是 MSVC 的 x86-64 输出(应用了名称拆分)。gcc/clang/ICC 都发出相同的代码。
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or
+mov
instead of lea
to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
所有 4 个编译器都使用or
+mov
而不是lea
将低半和高半组合到不同的寄存器中。我想这是他们未能优化的一种固定序列。
But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
但是自己用 inline asm 写也好不到哪里去。如果您的时间间隔如此短以至于只保留 32 位结果,那么您将剥夺编译器忽略 EDX 中结果的高 32 位的机会。或者,如果编译器决定将开始时间存储到内存中,它可以只使用两个 32 位存储而不是 shift/或 /mov。如果 1 个额外的 uop 作为时间的一部分困扰着您,您最好用纯 asm 编写整个微基准测试。
回答by a3nm
On Linux with gcc
, I use the following:
在带有 的 Linux 上gcc
,我使用以下内容:
/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
uint64_t x;
__asm__ volatile ("rdtsc" : "=A" (x));
return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
uint64_t a, d;
__asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
return (d<<32) | a;
}
#endif
/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed