如何从 C++ 获取 x86_64 中的 CPU 周期数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13772567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get the CPU cycle count in x86_64 from C++?
提问by user997112
I saw this post on SO which contains C code to get the latest CPU Cycle count:
我在 SO 上看到了这篇文章,其中包含用于获取最新 CPU 周期计数的 C 代码:
CPU Cycle count based profiling in C/C++ Linux x86_64
C/C++ Linux x86_64 中基于 CPU 周期计数的分析
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
有没有办法在 C++ 中使用此代码(欢迎使用 Windows 和 linux 解决方案)?尽管用 C 编写(并且 C 是 C++ 的子集),但我不太确定这段代码是否可以在 C++ 项目中运行,如果不能,如何翻译它?
I am using x86-64
我正在使用 x86-64
EDIT2:
编辑2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t
to long long
for windows....?)
找到了这个函数但是无法让VS2010识别汇编程序。我需要包括任何东西吗?(我相信我有交换uint64_t
到long long
了窗户......?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
编辑3:
From above code I get the error:
从上面的代码我得到错误:
"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"
“错误 C2400:‘操作码’中的内联汇编语法错误;找到‘数据类型’”
Could someone please help?
有人可以帮忙吗?
回答by Mysticial
Starting from GCC 4.5 and later, the __rdtsc()
intrinsicis now supported by both MSVC and GCC.
从GCC 4.5开始,后来,在__rdtsc()
固有现在由两个MSVC和海湾合作委员会的支持。
But the include that's needed is different:
但是需要的包含是不同的:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
这是 GCC 4.5 之前的原始答案。
Pulled directly out of one of my projects:
直接从我的一个项目中拉出来:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asmtells the compiler:
这个GNU C Extended asm告诉编译器:
volatile
: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result)."=a"(lo)
and"=d"(hi)
: the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86rdtsc
instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with"=r"
wouldn't work: there's no way to ask the CPU for the result to go anywhere else.((uint64_t)hi << 32) | lo
- zero-extend both 32-bit halves to 64-bit (because lo and hi areunsigned
), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
volatile
:输出不是输入的纯函数(因此每次都必须重新运行,而不是重用旧结果)。"=a"(lo)
和"=d"(hi)
:输出操作数是固定寄存器:EAX 和 EDX。(x86 机器限制)。x86rdtsc
指令将其 64 位结果放在 EDX:EAX 中,因此让编译器选择输出"=r"
是行不通的:无法要求 CPU 将结果传送到其他任何地方。((uint64_t)hi << 32) | lo
- 将两个 32 位的一半都零扩展到 64 位(因为 lo 和 hi 是unsigned
),并在逻辑上将它们 + OR 一起移到一个 64 位 C 变量中。在 32 位代码中,这只是重新解释;这些值仍然只保留在一对 32 位寄存器中。在 64 位代码中,您通常会得到一个实际的 shift + OR asm 指令,除非高半部分被优化掉。
(editor's note: this could probably be more efficient if you used unsigned long
instead of unsigned int
. Then the compiler would know that lo
was already zero-extended into RAX. It wouldn't know that the upper half was zero, so |
and +
are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
(编者注:这很可能是更有效的,如果你使用unsigned long
的不是unsigned int
那么编译器会知道。lo
已经零扩展到RAX它不会知道,上半部分是零,所以,|
和+
是等价的,如果它想以不同的方式合并。理论上,内在应该为您提供两全其美的方法,让优化器做得很好。)
https://gcc.gnu.org/wiki/DontUseInlineAsmif you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info
https://gcc.gnu.org/wiki/DontUseInlineAsm如果可以避免的话。但是如果您需要了解使用内联 asm 的旧代码,那么希望本节很有用,以便您可以使用内在函数重写它。另请参阅https://stackoverflow.com/tags/inline-assembly/info
回答by Peter Cordes
Your inline asm is broken for x86-64. "=A"
in 64-bit mode lets the compiler pick eitherRAX or RDX, not EDX:EAX. See this Q&A for more
对于 x86-64,您的内联汇编已损坏。 "=A"
在64位模式下让编译器挑要么RAX或RDX,不EDX:EAX。查看此问答了解更多
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc
and rdtscp
, and (at least these days) all define a __rdtsc
intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like @Mysticial's.
你不需要内联 asm 这个。没有任何好处;编译器内置了rdtsc
and rdtscp
,并且(至少现在)__rdtsc
如果包含正确的头文件,它们都会定义一个内在函数。但与几乎所有其他情况(https://gcc.gnu.org/wiki/DontUseInlineAsm)不同,只要您使用像@Mysticial 的.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers coulddo that optimization for you with a uint32_t time_low = __rdtsc()
intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
(asm 的一个小优势是,如果您想对肯定会小于 2^32 计数的小间隔进行计时,则可以忽略结果的高半部分。编译器可以使用uint32_t time_low = __rdtsc()
内在函数为您进行优化,但在练习他们有时仍然会浪费指令做轮班/或。)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
不幸的是,对于非 SIMD 内在函数使用哪个标头,MSVC 不同意其他所有人的意见。
Intel's intriniscs guidesays _rdtsc
(with one underscore) is in <immintrin.h>
, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>
, so we're stuck with <intrin.h>
(MSVC) vs. <x86intrin.h>
(everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
英特尔的 intriniscs 指南说_rdtsc
(带有一个下划线)是 in <immintrin.h>
,但这不适用于 gcc 和 clang。他们只在 中定义 SIMD 内在函数<immintrin.h>
,因此我们坚持使用<intrin.h>
(MSVC) 与<x86intrin.h>
(其他所有内容,包括最近的 ICC)。为了与 MSVC 和 Intel 的文档兼容,gcc 和 clang 定义了该函数的单下划线和双下划线版本。
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc()
as returning (signed) __int64
.
有趣的事实:双下划线版本返回一个无符号的 64 位整数,而英特尔记录_rdtsc()
为返回 (signed) __int64
。
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit.See the results on the Godbolt compiler explorer, including a couple test callers.
与所有 4 个主要编译器一起编译:gcc/clang/ICC/MSVC,适用于 32 位或 64 位。查看Godbolt 编译器资源管理器上的结果,包括几个测试调用者。
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc()
. Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
这些内在函数是 gcc4.5(从 2010 年开始)和 clang3.5(从 2014 年开始)中的新内容。Godbolt 上的 gcc4.4 和 clang 3.4 不编译这个,但 gcc4.5.3(2011 年 4 月)可以。您可能会在旧代码中看到内联 asm,但您可以并且应该将其替换为__rdtsc()
. 十多年以前的编译器生成的代码通常比 gcc6、gcc7 或 gcc8 慢,而且错误消息的用处也较少。
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc
in immintrin.h
, but doesn't have an x86intrin.h
at all. More recent ICC have x86intrin.h
, at least the way Godbolt installs them for Linux they do.
MSVC 内在的(我认为)存在的时间要长得多,因为 MSVC 从未支持 x86-64 的内联 asm。ICC13 有__rdtsc
in immintrin.h
,但根本没有x86intrin.h
。最近的 ICC 有x86intrin.h
,至少是 Godbolt 为 Linux 安装它们的方式。
You might want to define them as signed long long
, especially if you want to subtract them and convert to float. int64_t
-> float/double is more efficient than uint64_t
on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
您可能希望将它们定义为 signedlong long
,特别是如果您想减去它们并转换为浮点数。 int64_t
-> float/double 比uint64_t
没有 AVX512 的 x86更有效。此外,如果 TSC 没有完全同步,可能会因为 CPU 迁移而导致小的负面结果,这可能比巨大的无符号数字更有意义。
BTW, clang also has a portable __builtin_readcyclecounter()
which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
顺便说一句,clang 也有一个__builtin_readcyclecounter()
适用于任何架构的便携式。(在没有循环计数器的架构上总是返回零。)请参阅clang/LLVM 语言扩展文档
For more about using lfence
(or cpuid
) to improve repeatability of rdtsc
and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see @HadiBrais' answer on clflush to invalidate cache line via C functionand the comments for an example of the difference it makes.
有关使用lfence
(或cpuid
)rdtsc
通过阻止乱序执行来提高重复性和精确控制哪些指令在/不在定时间隔内的更多信息,请参阅@HadiBrais 对clflush的回答以通过 C 函数和评论其差异的示例。
See also Is LFENCE serializing on AMD processors?(TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid
to serialize.) It's always been defined as partially-serializing on Intel.
另请参阅LFENCE 是否在 AMD 处理器上进行序列化?(TL:DR 是启用 Spectre 缓解,否则内核会保留相关的 MSR 未设置,因此您应该用于cpuid
序列化。)它在 Intel 上始终被定义为部分序列化。
How to Benchmark Code Execution Times on Intel? IA-32 and IA-64 Instruction Set Architectures, an Intel white-paper from 2010.
如何在英特尔上对代码执行时间进行基准测试?IA-32 和 IA-64 指令集架构,英特尔 2010 年的白皮书。
rdtsc
counts referencecycles, not CPU core clock cycles
rdtsc
计数参考周期,而不是 CPU 内核时钟周期
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc
is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock
).
无论涡轮增压/节能如何,它都以固定频率计数,因此如果您想要每时钟 uops 分析,请使用性能计数器。 rdtsc
与挂钟时间完全相关(不计算系统时钟调整,因此它是 的完美时间源steady_clock
)。
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
TSC 频率过去总是等于 CPU 的额定频率,即广告标贴频率。在某些 CPU 中,它只是接近,例如 i7-6700HQ 2.6 GHz Skylake 上的 2592 MHz,或 4000MHz i7-6700k 上的 4008MHz。在 i5-1035 Ice Lake 等更新的 CPU 上,TSC = 1.5 GHz,base = 1.1 GHz,因此禁用 Turbo 甚至不适用于 TSC = 核心周期在这些 CPU 上。
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation?for other pitfalls.
如果您将它用于微基准测试,请先包括一个预热期,以确保您的 CPU 在开始计时之前已经处于最大时钟速度。(并且可以选择禁用 Turbo 并告诉您的操作系统更喜欢最大时钟速度以避免在您的微基准测试期间 CPU 频率偏移)。
微基准测试很难:看到性能评估的惯用方法吗?对于其他陷阱。
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr
in user-space, or simpler ways include tricks like perf stat for part of programif your timed region is long enough that you can attach a perf stat -p PID
.
您可以使用一个可以访问硬件性能计数器的库来代替 TSC。复杂的,但低开销的方法是程序PERF柜台和使用rdmsr
用户空间,或者更简单的方法包括像花样PERF的统计为计划的一部分,如果你的计时区是足够长的时间,你可以附上perf stat -p PID
。
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
但是,您通常仍希望为微基准测试保持 CPU 时钟固定不变,除非您想了解不同的负载将如何在内存受限或其他情况下使 Skylake 时钟下降。(请注意,内存带宽/延迟大多是固定的,使用与内核不同的时钟。在空闲时钟速度下,L2 或 L3 缓存未命中所需的内核时钟周期要少得多。)
- Negative clock cycle measurements with back-to-back rdtsc?the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (
constant_tsc
), which doesn't stop when the clock halts (nonstop_tsc
). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers). - std::chrono::clock, hardware clock and cycle count
- Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
- Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
- measuring code execution times in C using RDTSC instructionlists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with
cli
), and virtualization ofrdtsc
under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers. Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds.Otherwise, use a high-resolution library time function like
std::chrono
orclock_gettime
. See faster equivalent of gettimeofdayfor some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoidrdtsc
entirely if your precision requirement is low enough for a timer interrupt or thread to update it.See also Calculate system time using rdtscabout finding the crystal frequency and multiplier.
- 背靠背 rdtsc 的负时钟周期测量?RDTSC的历史:原来CPU是不做省电的,所以TSC既是实时时钟又是核心时钟。然后它通过各种几乎没有用的步骤演变成当前形式的有用的低开销时间源,与核心时钟周期分离 (
constant_tsc
),当时钟停止时它不会停止 (nonstop_tsc
)。还有一些提示,例如不要取平均时间,取中位数(会有非常高的异常值)。 - std::chrono::clock,硬件时钟和周期计数
- 使用 RDTSC 获取 cpu 周期 - 为什么 RDTSC 的值总是增加?
- 英特尔的周期丢失?rdtsc 和 CPU_CLK_UNHALTED.REF_TSC 之间的不一致
- 使用 RDTSC 指令测量 C 中的代码执行时间列出了一些问题,包括即使在内核模式下也无法避免的 SMI(系统管理中断
cli
),以及虚拟rdtsc
机下的虚拟化。当然,像常规中断这样的基本东西是可能的,所以多次重复你的计时并扔掉异常值。 确定 Linux 上的 TSC 频率。 以编程方式查询 TSC 频率很困难,而且可能是不可能的,尤其是在用户空间中,或者可能给出比校准更糟糕的结果。使用另一个已知时间源对其进行校准需要时间。有关将 TSC 转换为纳秒的难度的更多信息,请参阅该问题(如果您可以询问操作系统转换率是多少,因为操作系统已经在启动时进行了转换)。
如果您使用 RDTSC 进行微基准测试以进行调整,那么最好的办法是只使用刻度并跳过甚至尝试转换为纳秒。否则,请使用高分辨率库时间函数,如
std::chrono
或clock_gettime
。有关时间戳函数的一些讨论/比较,请参阅更快的 gettimeofday 等效项,或者从内存中读取共享时间戳以rdtsc
完全避免如果您的精度要求足够低以供定时器中断或线程更新它。另请参阅使用 rdtsc 计算系统时间,了解有关查找晶体频率和乘数的信息。
CPU TSC fetch operation especially in multicore-multi-processor environmentsays that Nehalem and newer have the TSC synced and locked together for all cores in a package(along with the invariant = constant and nonstop TSC feature). See @amdn's answer there for some good info about multi-socket sync.
CPU TSC 获取操作,尤其是在多核-多处理器环境中,表示Nehalem 和更新版本将 TSC 同步并锁定到一个包中的所有内核(以及不变 = 恒定和不间断 TSC 功能)。有关多套接字同步的一些好信息,请参阅@amdn 的回答。
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see @amdn's answer on the linked question, and more details below.)
(而且显然即使对于现代多路系统来说通常也是可靠的,只要它们具有该功能,请参阅@amdn 对链接问题的回答,以及下面的更多详细信息。)
CPUID features relevant to the TSC
与 TSC 相关的 CPUID 功能
Using the names that Linux /proc/cpuinfo
uses for the CPU features, and other aliases for the same feature that you'll also find.
使用Linux/proc/cpuinfo
用于 CPU features的名称,以及您还会发现的相同功能的其他别名。
tsc
- the TSC exists andrdtsc
is supported. Baseline for x86-64.rdtscp
-rdtscp
is supported.tsc_deadline_timer
CPUID.01H:ECX.TSC_Deadline[bit 24] = 1
- local APIC can be programmed to fire an interrupt when the TSC reaches a value you put inIA32_TSC_DEADLINE
. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.constant_tsc
: IntelCPUID.80000007H:EDX[8]
aka invariant TSC: The TSC ticks at constant frequency regardless of turbo / idle changes in core clock speed.Intel since at least Core 2, maybe earlier. Without this, RDTSC doescount core clock cycles.nonstop_tsc
: The TSC keeps ticking even in deep sleep states like ACPI C6 where the core is mostly powered down in a low-power state until the next interrupt.No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both
constant_tsc
andnonstop_tsc
features. See Linux's x86/kernel/cpu/intel.c detection code, andamd.c
was similar. I didn't check the Linux's.c
files for other vendors, maybe there are some where it has to unset nonstop. Some CPUs Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep:nonstop_tsc_s3
.I searched but didn't find anywhere that Linux unsets the internal
X86_FEATURE_NONSTOP_TSC
feature bit so IDK if there are any CPUs with one but not the other, but presumably they are separate features for a reason. I think older Linux kernels didn't have a separate name, justconstant_tsc
.tsc_adjust
:CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1)
TheIA32_TSC_ADJUST
MSR is available, allowing OSes to set an offset that's added to the TSC whenrdtsc
orrdtscp
reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
tsc
- TSC 存在并rdtsc
受支持。x86-64 的基线。rdtscp
-rdtscp
支持。tsc_deadline_timer
CPUID.01H:ECX.TSC_Deadline[bit 24] = 1
- 本地 APIC 可以编程为当 TSC 达到您输入的值时触发中断IA32_TSC_DEADLINE
。启用“无滴答”内核,我认为,休眠直到应该发生的下一件事情。constant_tsc
:英特尔CPUID.80000007H:EDX[8]
又名不变 TSC:TSC 以恒定频率滴答作响,而不管核心时钟速度的 turbo/idle 变化如何。英特尔至少从 Core 2 开始,也许更早。没有这个,RDTSC会计算核心时钟周期。nonstop_tsc
:即使在像 ACPI C6 这样的深度睡眠状态下,TSC 也会保持滴答作响,在这种状态下,内核大部分时间处于低功耗状态,直到下一个中断。没有单独的 CPUID 特征位;在 Intel 和 AMD 上,相同的不变 TSC CPUID 位暗示了两者
constant_tsc
和nonstop_tsc
特性。参见Linux 的 x86/kernel/cpu/intel.c 检测代码,与此amd.c
类似。我没有检查.c
其他供应商的 Linux文件,也许有些地方必须不停地取消设置。一些 CPU Saltwell/Silvermont/Airmont 甚至在 ACPI S3 全系统睡眠中保持 TSC 滴答作响:nonstop_tsc_s3
.我搜索了但没有找到任何地方 Linux 取消设置内部
X86_FEATURE_NONSTOP_TSC
功能位,所以 IDK 如果有任何 CPU 有一个而不是另一个,但大概是出于某种原因它们是单独的功能。我认为较旧的 Linux 内核没有单独的名称,只有constant_tsc
.tsc_adjust
:CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1)
该IA32_TSC_ADJUST
MSR是可用的,允许操作系统设置一个偏移量的增加时,TSCrdtsc
或rdtscp
读取它。这允许有效地更改某些/所有内核上的 TSC,而无需跨逻辑内核对其进行去同步。(如果软件在每个内核上将 TSC 设置为新的绝对值,就会发生这种情况;很难在每个内核上以相同的周期执行相关的 WRMSR 指令。)
constant_tsc
and nonstop_tsc
together make the TSC usable as a timesource for things like clock_gettime
in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
constant_tsc
并nonstop_tsc
一起使 TSC 可用作clock_gettime
用户空间等事物的时间源。(但像 Linux 这样的操作系统只使用 RDTSC 在用 NTP 维护的较慢时钟的滴答声之间进行插值,更新定时器中断中的比例/偏移因子。参见在带有 constant_tsc 和 nonstop_tsc 的 cpu 上,为什么我的时间会漂移?)在更旧的 CPU 上不支持深度睡眠状态或频率缩放,TSC 作为时间源可能仍然可用
The comments in the Linux source codealso indicate that constant_tsc
/ nonstop_tsc
features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
Linux 源代码中的注释还表明constant_tsc
/nonstop_tsc
特性(在英特尔上)暗示“它在内核和套接字之间也是可靠的。(但不是跨机柜 - 在这种情况下我们明确地将其关闭。)”
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets.Apparently platform vendors doin practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environmentalso agree that all sockets on a single motherboard should start out in sync.
“跨插座”部分不准确。通常,不变 TSC 仅保证 TSC 在同一插槽内的内核之间同步。在英特尔论坛主题上,Martin Dixon(英特尔)指出TSC 不变性并不意味着跨套接字同步。这要求平台供应商将 RESET 同步分发到所有套接字。显然,平台厂商都在实践中做到这一点,在上述的Linux内核评论。关于CPU TSC 获取操作的答案,尤其是在多核多处理器环境中,也同意单个主板上的所有插槽都应该同步开始。
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource'
would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system.The kernel paramter tsc=reliable
can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
在多路共享内存系统上,没有直接的方法来检查所有内核中的 TSC 是否同步。Linux 内核默认执行启动时和运行时检查,以确保 TSC 可用作时钟源。这些检查涉及确定 TSC 是否同步。命令的输出dmesg | grep 'clocksource'
会告诉您内核是否使用 TSC 作为时钟源,这只有在检查通过时才会发生。但即便如此,这也不是 TSC 在系统的所有套接字上同步的明确证据。内核参数tsc=reliable
可以用来告诉内核它可以不做任何检查而盲目地使用TSC作为时钟源。
There are two cases where cross-socket TSCs are commonly NOT in sync: (1) hotplugging a CPU, and (2) when the sockets are spread out across different boards connected by extended node controllers.
有两种情况下,跨插槽 TSC 通常不同步:(1) 热插拔 CPU,以及 (2) 当插槽分布在由扩展节点控制器连接的不同板上时。
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp
produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
直接更改 TSC 而不是使用 TSC_ADJUST 偏移量的操作系统或管理程序可以取消同步它们,因此在用户空间中,假设 CPU 迁移不会让您读取不同的时钟可能并不总是安全的。(这就是为什么rdtscp
产生一个 core-ID 作为额外输出,这样你就可以检测何时开始/结束时间来自不同的时钟。它可能是在不变 TSC 功能之前引入的,或者他们只是想考虑每一种可能性。 )
If you're using rdtsc
directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram
on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
如果您rdtsc
直接使用,您可能希望将您的程序或线程固定到一个核心,例如taskset -c 0 ./myprogram
在 Linux 上。无论 TSC 是否需要它,CPU 迁移通常会导致大量缓存未命中,并且无论如何都会弄乱您的测试,并且需要额外的时间。(尽管中断也会如此)。
How efficient is the asm from using the intrinsic?
asm 使用内在函数的效率如何?
It's about as good as you'd get from @Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
它和你从@Mysticial 的 GNU C 内联汇编中得到的一样好,或者更好,因为它知道 RAX 的高位被清零。您想要保持内联 asm 的主要原因是为了与硬朗的旧编译器兼容。
A non-inline version of the readTSC
function itself compiles with MSVC for x86-64 like this:
readTSC
函数本身的非内联版本使用 MSVC for x86-64 编译,如下所示:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax
, it's just rdtsc
/ret
. Not that it matters, you always want this to inline.
对于在 中返回 64 位整数的 32 位调用约定edx:eax
,它只是rdtsc
/ ret
。这并不重要,你总是希望它内联。
In a test caller that uses it twice and subtracts to time an interval:
在使用它两次并减去时间间隔的测试调用者中:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
所有 4 个编译器都生成非常相似的代码。这是 GCC 的 32 位输出:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
这是 MSVC 的 x86-64 输出(应用了名称拆分)。gcc/clang/ICC 都发出相同的代码。
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or
+mov
instead of lea
to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
所有 4 个编译器都使用or
+mov
而不是lea
将低半和高半组合到不同的寄存器中。我想这是他们未能优化的一种固定序列。
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
但是自己在内联汇编中编写 shift/lea 也好不到哪里去。如果您的时间间隔如此短以至于只保留 32 位结果,那么您将剥夺编译器忽略 EDX 中结果的高 32 位的机会。或者,如果编译器决定将开始时间存储到内存中,它可以只使用两个 32 位存储而不是 shift/或 /mov。如果 1 个额外的 uop 作为时间的一部分困扰着您,您最好用纯 asm 编写整个微基准测试。
However, we can maybe get the best of both worlds with a modified version of @Mysticial's code:
但是,我们可以使用@Mysticial 代码的修改版本来两全其美:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc()
for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo)
. Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
在 Godbolt 上,这有时确实比__rdtsc()
gcc/clang/ICC提供更好的 asm ,但有时它会诱使编译器使用额外的寄存器来分别保存 lo 和 hi,因此 clang 可以优化为((end_hi-start_hi)<<32) + (end_lo-start_lo)
. 希望如果有真正的寄存器压力,编译器会更早地结合起来。(gcc 和 ICC 仍然分别保存 lo/hi,但也不优化。)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc()
function itself with an actual add/adc
with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with |
instead of +
, but definitely prefer the __rdtsc()
intrinsic if you care about 32-bit code-gen from gcc).
但是 32 位 gcc8 把它弄得一团糟,甚至只是rdtsc()
用一个add/adc
带零的实际编译函数本身,而不是像 clang 那样只在 edx:eax 中返回结果。(gcc6 和更早版本可以使用|
而不是+
,但__rdtsc()
如果您关心来自 gcc 的 32 位代码生成,则绝对更喜欢内在的)。
回答by Jerry Coffin
VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
VC++ 对内联汇编使用完全不同的语法——但仅限于 32 位版本。64 位编译器根本不支持内联汇编。
In this case, that's probably just as well -- rdtsc
has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc
before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
在这种情况下,这可能也一样——rdtsc
在时序代码序列方面(至少)有两个主要问题。首先(像大多数指令一样)它可以乱序执行,所以如果你试图对一小段代码进行计时,那么rdtsc
之前和之后的代码可能都在它之前执行,或者都在它之后执行,或者你有什么(我相当确定这两者将始终按照彼此的顺序执行,因此至少差异永远不会是负数)。
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result isentirely possible.
其次,在多核(或多处理器)系统上,一个 rdtsc 可能在一个核/处理器上执行,另一个在不同的核/处理器上执行。在这种情况下,一个负的结果是完全可能的。
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter
.
一般来说,如果你想在 Windows 下有一个精确的计时器,你最好使用QueryPerformanceCounter
.
If you really insist on using rdtsc
, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
如果您真的坚持使用rdtsc
,我相信您必须在完全用汇编语言编写的单独模块中进行操作(或使用内部编译器),然后与您的 C 或 C++ 链接。我从未为 64 位模式编写过该代码,但在 32 位模式下它看起来像这样:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
我知道这看起来很奇怪,但实际上是正确的。您执行 CPUID 是因为它是一个序列化指令(不能乱序执行)并且在用户模式下可用。您在开始计时之前执行它三次,因为英特尔记录了这样一个事实,即第一次执行可以/将以与第二次不同的速度运行(他们推荐的是三,所以是三)。
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
然后你执行你的测试代码,另一个 cpuid 强制序列化,最后的 rdtsc 获取代码完成后的时间。
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
除此之外,您想使用操作系统提供的任何方式来强制这一切都在一个进程/核心上运行。在大多数情况下,您还希望强制代码对齐——对齐的更改可能导致执行速度的相当大的差异。
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
最后,您想多次执行它——并且它总是有可能在事情中间被中断(例如,任务切换),因此您需要为执行需要相当多的时间做好准备比其他时间长——例如,5 次运行每次需要大约 40-43 个时钟周期,第六次运行需要 10000+ 个时钟周期。显然,在后一种情况下,您只需丢弃异常值——它不是来自您的代码。
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you needto do before you can get results from rdtsc
that will actually mean anything.
总结:设法执行 rdtsc 指令本身(几乎)是您最不担心的。在获得结果之前,您还需要做更多的事情,rdtsc
这实际上意味着什么。
回答by Nik Bougalis
For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
对于 Windows,Visual Studio 提供了一个方便的“编译器内部函数”(即编译器理解的特殊函数),它为您执行 RDTSC 指令并返回结果:
unsigned __int64 __rdtsc(void);