如何从 C++ 获取 x86_64 中的 CPU 周期数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13772567/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 17:42:13  来源:igfitidea点击:

How to get the CPU cycle count in x86_64 from C++?

c++cperformancex86rdtsc

提问by user997112

I saw this post on SO which contains C code to get the latest CPU Cycle count:

我在 SO 上看到了这篇文章,其中包含用于获取最新 CPU 周期计数的 C 代码:

CPU Cycle count based profiling in C/C++ Linux x86_64

C/C++ Linux x86_64 中基于 CPU 周期计数的分析

Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?

有没有办法在 C++ 中使用此代码(欢迎使用 Windows 和 linux 解决方案)?尽管用 C 编写(并且 C 是 C++ 的子集),但我不太确定这段代码是否可以在 C++ 项目中运行,如果不能,如何翻译它?

I am using x86-64

我正在使用 x86-64

EDIT2:

编辑2:

Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_tto long longfor windows....?)

找到了这个函数但是无法让VS2010识别汇编程序。我需要包括任何东西吗?(我相信我有交换uint64_tlong long了窗户......?)

static inline uint64_t get_cycles()
{
  uint64_t t;
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

EDIT3:

编辑3:

From above code I get the error:

从上面的代码我得到错误:

"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"

“错误 C2400:‘操作码’中的内联汇编语法错误;找到‘数据类型’”

Could someone please help?

有人可以帮忙吗?

回答by Mysticial

Starting from GCC 4.5 and later, the __rdtsc()intrinsicis now supported by both MSVC and GCC.

从GCC 4.5开始,后来,__rdtsc()固有现在由两个MSVC和海湾合作委员会的支持。

But the include that's needed is different:

但是需要的包含是不同的:

#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif


Here's the original answer before GCC 4.5.

这是 GCC 4.5 之前的原始答案。

Pulled directly out of one of my projects:

直接从我的一个项目中拉出来:

#include <stdint.h>

//  Windows
#ifdef _WIN32

#include <intrin.h>
uint64_t rdtsc(){
    return __rdtsc();
}

//  Linux/GCC
#else

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

#endif

This GNU C Extended asmtells the compiler:

这个GNU C Extended asm告诉编译器:

  • volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
  • "=a"(lo)and "=d"(hi): the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtscinstruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r"wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
  • ((uint64_t)hi << 32) | lo- zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
  • volatile:输出不是输入的纯函数(因此每次都必须重新运行,而不是重用旧结果)。
  • "=a"(lo)"=d"(hi):输出操作数是固定寄存器:EAX 和 EDX。(x86 机器限制)。x86rdtsc指令将其 64 位结果放在 EDX:EAX 中,因此让编译器选择输出"=r"是行不通的:无法要求 CPU 将结果传送到其他任何地方。
  • ((uint64_t)hi << 32) | lo- 将两个 32 位的一半都零扩展到 64 位(因为 lo 和 hi 是unsigned),并在逻辑上将它们 + OR 一起移到一个 64 位 C 变量中。在 32 位代码中,这只是重新解释;这些值仍然只保留在一对 32 位寄存器中。在 64 位代码中,您通常会得到一个实际的 shift + OR asm 指令,除非高半部分被优化掉。

(editor's note: this could probably be more efficient if you used unsigned longinstead of unsigned int. Then the compiler would know that lowas already zero-extended into RAX. It wouldn't know that the upper half was zero, so |and +are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)

(编者注:这很可能是更有效的,如果你使用unsigned long的不是unsigned int那么编译器会知道。lo已经零扩展到RAX它不会知道,上半部分是零,所以,|+是等价的,如果它想以不同的方式合并。理论上,内在应该为您提供两全其美的方法,让优化器做得很好。)

https://gcc.gnu.org/wiki/DontUseInlineAsmif you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

https://gcc.gnu.org/wiki/DontUseInlineAsm如果可以避免的话。但是如果您需要了解使用内联 asm 的旧代码,那么希望本节很有用,以便您可以使用内在函数重写它。另请参阅https://stackoverflow.com/tags/inline-assembly/info

回答by Peter Cordes

Your inline asm is broken for x86-64. "=A"in 64-bit mode lets the compiler pick eitherRAX or RDX, not EDX:EAX. See this Q&A for more

对于 x86-64,您的内联汇编已损坏。 "=A"在64位模式下让编译器挑要么RAX或RDX,不EDX:EAX。查看此问答了解更多



You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtscand rdtscp, and (at least these days) all define a __rdtscintrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like @Mysticial's.

你不需要内联 asm 这个。没有任何好处;编译器内置了rdtscand rdtscp,并且(至少现在)__rdtsc如果包含正确的头文件,它们都会定义一个内在函数。但与几乎所有其他情况(https://gcc.gnu.org/wiki/DontUseInlineAsm)不同,只要您使用像@Mysticial 的.

(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers coulddo that optimization for you with a uint32_t time_low = __rdtsc()intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)

(asm 的一个小优势是,如果您想对肯定会小于 2^32 计数的小间隔进行计时,则可以忽略结果的高半部分。编译器可以使用uint32_t time_low = __rdtsc()内在函数为您进行优化,但在练习他们有时仍然会浪费指令做轮班/或。)



Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.

不幸的是,对于非 SIMD 内在函数使用哪个标头,MSVC 不同意其他所有人的意见。

Intel's intriniscs guidesays _rdtsc(with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h>(MSVC) vs. <x86intrin.h>(everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.

英特尔的 intriniscs 指南_rdtsc(带有一个下划线)是 in <immintrin.h>,但这不适用于 gcc 和 clang。他们只在 中定义 SIMD 内在函数<immintrin.h>,因此我们坚持使用<intrin.h>(MSVC) 与<x86intrin.h>(其他所有内容,包括最近的 ICC)。为了与 MSVC 和 Intel 的文档兼容,gcc 和 clang 定义了该函数的单下划线和双下划线版本。

Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc()as returning (signed) __int64.

有趣的事实:双下划线版本返回一个无符号的 64 位整数,而英特尔记录_rdtsc()为返回 (signed) __int64

// valid C99 and C++

#include <stdint.h>  // <cstdint> is preferred in C++, but stdint.h works.

#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
    return tsc;
}

// requires a Nehalem or newer CPU.  Not Core2 or earlier.  IDK when AMD added it.
inline
uint64_t readTSCp() {
    unsigned dummy;
    return __rdtscp(&dummy);  // waits for earlier insns to retire, but allows later to start
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit.See the results on the Godbolt compiler explorer, including a couple test callers.

与所有 4 个主要编译器一起编译:gcc/clang/ICC/MSVC,适用于 32 位或 64 位。查看Godbolt 编译器资源管理器上的结果,包括几个测试调用者。

These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.

这些内在函数是 gcc4.5(从 2010 年开始)和 clang3.5(从 2014 年开始)中的新内容。Godbolt 上的 gcc4.4 和 clang 3.4 不编译这个,但 gcc4.5.3(2011 年 4 月)可以。您可能会在旧代码中看到内联 asm,但您可以并且应该将其替换为__rdtsc(). 十多年以前的编译器生成的代码通常比 gcc6、gcc7 或 gcc8 慢,而且错误消息的用处也较少。

The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtscin immintrin.h, but doesn't have an x86intrin.hat all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.

MSVC 内在的(我认为)存在的时间要长得多,因为 MSVC 从未支持 x86-64 的内联 asm。ICC13 有__rdtscin immintrin.h,但根本没有x86intrin.h。最近的 ICC 有x86intrin.h,至少是 Godbolt 为 Linux 安装它们的方式。

You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t-> float/double is more efficient than uint64_ton x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.

您可能希望将它们定义为 signedlong long,特别是如果您想减去它们并转换为浮点数。 int64_t-> float/double 比uint64_t没有 AVX512 的 x86更有效。此外,如果 TSC 没有完全同步,可能会因为 CPU 迁移而导致小的负面结果,这可能比巨大的无符号数字更有意义。



BTW, clang also has a portable __builtin_readcyclecounter()which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs

顺便说一句,clang 也有一个__builtin_readcyclecounter()适用于任何架构的便携式。(在没有循环计数器的架构上总是返回零。)请参阅clang/LLVM 语言扩展文档



For more about using lfence(or cpuid) to improve repeatability of rdtscand control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see @HadiBrais' answer on clflush to invalidate cache line via C functionand the comments for an example of the difference it makes.

有关使用lfence(或cpuidrdtsc通过阻止乱序执行来提高重复性和精确控制哪些指令在/不在定时间隔内的更多信息,请参阅@HadiBrais 对clflush的回答以通过 C 函数和评论其差异的示例。

See also Is LFENCE serializing on AMD processors?(TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuidto serialize.) It's always been defined as partially-serializing on Intel.

另请参阅LFENCE 是否在 AMD 处理器上进行序列化?(TL:DR 是启用 Spectre 缓解,否则内核会保留相关的 MSR 未设置,因此您应该用于cpuid序列化。)它在 Intel 上始终被定义为部分序列化。

How to Benchmark Code Execution Times on Intel? IA-32 and IA-64 Instruction Set Architectures, an Intel white-paper from 2010.

如何在英特尔上对代码执行时间进行基准测试?IA-32 和 IA-64 指令集架构,英特尔 2010 年的白皮书。



rdtsccounts referencecycles, not CPU core clock cycles

rdtsc计数参考周期,而不是 CPU 内核时钟周期

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtscis exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).

无论涡轮增压/节能如何,它都以固定频率计数,因此如果您想要每时钟 uops 分析,请使用性能计数器。 rdtsc与挂钟时间完全相关(不计算系统时钟调整,因此它是 的完美时间源steady_clock)。

The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.

TSC 频率过去总是等于 CPU 的额定频率,即广告标贴频率。在某些 CPU 中,它只是接近,例如 i7-6700HQ 2.6 GHz Skylake 上的 2592 MHz,或 4000MHz i7-6700k 上的 4008MHz。在 i5-1035 Ice Lake 等更新的 CPU 上,TSC = 1.5 GHz,base = 1.1 GHz,因此禁用 Turbo 甚至不适用于 TSC = 核心周期在这些 CPU 上。

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation?for other pitfalls.

如果您将它用于微基准测试,请先包括一个预热期,以确保您的 CPU 在开始计时之前已经处于最大时钟速度。(并且可以选择禁用 Turbo 并告诉您的操作系统更喜欢最大时钟速度以避免在您的微基准测试期间 CPU 频率偏移)。
微基准测试很难:看到性能评估的惯用方法吗?对于其他陷阱。

Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsrin user-space, or simpler ways include tricks like perf stat for part of programif your timed region is long enough that you can attach a perf stat -p PID.

您可以使用一个可以访问硬件性能计数器的库来代替 TS​​C。复杂的,但低开销的方法是程序PERF柜台和使用rdmsr用户空间,或者更简单的方法包括像花样PERF的统计为计划的一部分,如果你的计时区是足够长的时间,你可以附上perf stat -p PID

You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)

但是,您通常仍希望为微基准测试保持 CPU 时钟固定不变,除非您想了解不同的负载将如何在内存受限或其他情况下使 Skylake 时钟下降。(请注意,内存带宽/延迟大多是固定的,使用与内核不同的时钟。在空闲时钟速度下,L2 或 L3 缓存未命中所需的内核时钟周期要少得多。)

CPU TSC fetch operation especially in multicore-multi-processor environmentsays that Nehalem and newer have the TSC synced and locked together for all cores in a package(along with the invariant = constant and nonstop TSC feature). See @amdn's answer there for some good info about multi-socket sync.

CPU TSC 获取操作,尤其是在多核-多处理器环境中,表示Nehalem 和更新版本将 TSC 同步并锁定到一个包中的所有内核(以及不变 = 恒定和不间断 TSC 功能)。有关多套接字同步的一些好信息,请参阅@amdn 的回答。

(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see @amdn's answer on the linked question, and more details below.)

(而且显然即使对于现代多路系统来说通常也是可靠的,只要它们具有该功能,请参阅@amdn 对链接问题的回答,以及下面的更多详细信息。)



CPUID features relevant to the TSC

与 TSC 相关的 CPUID 功能

Using the names that Linux /proc/cpuinfouses for the CPU features, and other aliases for the same feature that you'll also find.

使用Linux/proc/cpuinfo用于 CPU features的名称,以及您还会发现的相同功能的其他别名。

  • tsc- the TSC exists and rdtscis supported. Baseline for x86-64.
  • rdtscp- rdtscpis supported.
  • tsc_deadline_timerCPUID.01H:ECX.TSC_Deadline[bit 24] = 1- local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
  • constant_tsc: Intel CPUID.80000007H:EDX[8]aka invariant TSC: The TSC ticks at constant frequency regardless of turbo / idle changes in core clock speed.Intel since at least Core 2, maybe earlier. Without this, RDTSC doescount core clock cycles.

  • nonstop_tsc: The TSC keeps ticking even in deep sleep states like ACPI C6 where the core is mostly powered down in a low-power state until the next interrupt.

    No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tscand nonstop_tscfeatures. See Linux's x86/kernel/cpu/intel.c detection code, and amd.cwas similar. I didn't check the Linux's .cfiles for other vendors, maybe there are some where it has to unset nonstop. Some CPUs Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3.

    I searched but didn't find anywhere that Linux unsets the internal X86_FEATURE_NONSTOP_TSCfeature bit so IDK if there are any CPUs with one but not the other, but presumably they are separate features for a reason. I think older Linux kernels didn't have a separate name, just constant_tsc.

  • tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1)The IA32_TSC_ADJUSTMSR is available, allowing OSes to set an offset that's added to the TSC when rdtscor rdtscpreads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)

  • tsc- TSC 存在并rdtsc受支持。x86-64 的基线。
  • rdtscp-rdtscp支持。
  • tsc_deadline_timerCPUID.01H:ECX.TSC_Deadline[bit 24] = 1- 本地 APIC 可以编程为当 TSC 达到您输入的值时触发中断IA32_TSC_DEADLINE。启用“无滴答”内核,我认为,休眠直到应该发生的下一件事情。
  • constant_tsc:英特尔CPUID.80000007H:EDX[8]又名不变 TSC:TSC 以恒定频率滴答作响,而不管核心时钟速度的 turbo/idle 变化如何。英特尔至少从 Core 2 开始,也许更早。没有这个,RDTSC计算核心时钟周期。

  • nonstop_tsc:即使在像 ACPI C6 这样的深度睡眠状态下,TSC 也会保持滴答作响,在这种状态下,内核大部分时间处于低功耗状态,直到下一个中​​断。

    没有单独的 CPUID 特征位;在 Intel 和 AMD 上,相同的不变 TSC CPUID 位暗示了两者constant_tscnonstop_tsc特性。参见Linux 的 x86/kernel/cpu/intel.c 检测代码,与此amd.c类似。我没有检查.c其他供应商的 Linux文件,也许有些地方必须不停地取消设置。一些 CPU Saltwell/Silvermont/Airmont 甚至在 ACPI S3 全系统睡眠中保持 TSC 滴答作响:nonstop_tsc_s3.

    我搜索了但没有找到任何地方 Linux 取消设置内部X86_FEATURE_NONSTOP_TSC功能位,所以 IDK 如果有任何 CPU 有一个而不是另一个,但大概是出于某种原因它们是单独的功能。我认为较旧的 Linux 内核没有单独的名称,只有constant_tsc.

  • tsc_adjustCPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1)IA32_TSC_ADJUSTMSR是可用的,允许操作系统设置一个偏移量的增加时,TSCrdtscrdtscp读取它。这允许有效地更改某些/所有内核上的 TSC,而无需跨逻辑内核对其进行去同步。(如果软件在每个内核上将 TSC 设置为新的绝对值,就会发生这种情况;很难在每个内核上以相同的周期执行相关的 WRMSR 指令。)

constant_tscand nonstop_tsctogether make the TSC usable as a timesource for things like clock_gettimein user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable

constant_tscnonstop_tsc一起使 TSC 可用作clock_gettime用户空间等事物的时间源。(但像 Linux 这样的操作系统只使用 RDTSC 在用 NTP 维护的较慢时钟的滴答声之间进行插值,更新定时器中断中的比例/偏移因子。参见在带有 constant_tsc 和 nonstop_tsc 的 cpu 上,为什么我的时间会漂移?)在更旧的 CPU 上不支持深度睡眠状态或频率缩放,TSC 作为时间源可能仍然可用

The comments in the Linux source codealso indicate that constant_tsc/ nonstop_tscfeatures (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"

Linux 源代码中注释还表明constant_tsc/nonstop_tsc特性(在英特尔上)暗示“它在内核和套接字之间也是可靠的。(但不是跨机柜 - 在这种情况下我们明确地将其关闭。)

The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets.Apparently platform vendors doin practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environmentalso agree that all sockets on a single motherboard should start out in sync.

“跨插座”部分不准确。通常,不变 TSC 仅保证 TSC 在同一插槽内的内核之间同步。在英特尔论坛主题上,Martin Dixon(英特尔)指出TSC 不变性并不意味着跨套接字同步。这要求平台供应商将 RESET 同步分发到所有套接字。显然,平台厂商在实践中做到这一点,在上述的Linux内核评论。关于CPU TSC 获取操作的答案,尤其是在多核多处理器环境中,也同意单个主板上的所有插槽都应该同步开始。

On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource'would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system.The kernel paramter tsc=reliablecan be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.

在多路共享内存系统上,没有直接的方法来检查所有内核中的 TSC 是否同步。Linux 内核默认执行启动时和运行时检查,以确保 TSC 可用作时钟源。这些检查涉及确定 TSC 是否同步。命令的输出dmesg | grep 'clocksource'会告诉您内核是否使用 TSC 作为时钟源,这只有在检查通过时才会发生。但即便如此,这也不是 TSC 在系统的所有套接字上同步的明确证据。内核参数tsc=reliable可以用来告诉内核它可以不做任何检查而盲目地使用TSC作为时钟源。

There are two cases where cross-socket TSCs are commonly NOT in sync: (1) hotplugging a CPU, and (2) when the sockets are spread out across different boards connected by extended node controllers.

有两种情况下,跨插槽 TSC 通常不同步:(1) 热插拔 CPU,以及 (2) 当插槽分布在由扩展节点控制器连接的不同板上时。

An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscpproduces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)

直接更改 TSC 而不是使用 TSC_ADJUST 偏移量的操作系统或管理程序可以取消同步它们,因此在用户空间中,假设 CPU 迁移不会让您读取不同的时钟可能并不总是安全的。(这就是为什么rdtscp产生一个 core-ID 作为额外输出,这样你就可以检测何时开始/结束时间来自不同的时钟。它可能是在不变 TSC 功能之前引入的,或者他们只是想考虑每一种可能性。 )

If you're using rdtscdirectly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogramon Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).

如果您rdtsc直接使用,您可能希望将您的程序或线程固定到一个核心,例如taskset -c 0 ./myprogram在 Linux 上。无论 TSC 是否需要它,CPU 迁移通常会导致大量缓存未命中,并且无论如何都会弄乱您的测试,并且需要额外的时间。(尽管中断也会如此)。



How efficient is the asm from using the intrinsic?

asm 使用内在函数的效率如何?

It's about as good as you'd get from @Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.

它和你从@Mysticial 的 GNU C 内联汇编中得到的一样好,或者更好,因为它知道 RAX 的高位被清零。您想要保持内联 asm 的主要原因是为了与硬朗的旧编译器兼容。

A non-inline version of the readTSCfunction itself compiles with MSVC for x86-64 like this:

readTSC函数本身的非内联版本使用 MSVC for x86-64 编译,如下所示:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

对于在 中返回 64 位整数的 32 位调用约定edx:eax,它只是rdtsc/ ret。这并不重要,你总是希望它内联。

In a test caller that uses it twice and subtracts to time an interval:

在使用它两次并减去时间间隔的测试调用者中:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

所有 4 个编译器都生成非常相似的代码。这是 GCC 的 32 位输出:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

这是 MSVC 的 x86-64 输出(应用了名称拆分)。gcc/clang/ICC 都发出相同的代码。

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+movinstead of leato combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

所有 4 个编译器都使用or+mov而不是lea将低半和高半组合到不同的寄存器中。我想这是他们未能优化的一种固定序列。

But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

但是自己在内联汇编中编写 shift/lea 也好不到哪里去。如果您的时间间隔如此短以至于只保留 32 位结果,那么您将剥夺编译器忽略 EDX 中结果的高 32 位的机会。或者,如果编译器决定将开始时间存储到内存中,它可以只使用两个 32 位存储而不是 shift/或 /mov。如果 1 个额外的 uop 作为时间的一部分困扰着您,您最好用纯 asm 编写整个微基准测试。

However, we can maybe get the best of both worlds with a modified version of @Mysticial's code:

但是,我们可以使用@Mysticial 代码的修改版本来两全其美:

// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
    // long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.

    unsigned long lo,hi;  // let the compiler know that zero-extension to 64 bits isn't required
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) + lo;
    // + allows LEA or ADD instead of OR
}

On Godbolt, this does sometimes give better asm than __rdtsc()for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)

在 Godbolt 上,这有时确实比__rdtsc()gcc/clang/ICC提供更好的 asm ,但有时它会诱使编译器使用额外的寄存器来分别保存 lo 和 hi,因此 clang 可以优化为((end_hi-start_hi)<<32) + (end_lo-start_lo). 希望如果有真正的寄存器压力,编译器会更早地结合起来。(gcc 和 ICC 仍然分别保存 lo/hi,但也不优化。)

But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc()function itself with an actual add/adcwith zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with |instead of +, but definitely prefer the __rdtsc()intrinsic if you care about 32-bit code-gen from gcc).

但是 32 位 gcc8 把它弄得一团糟,甚至只是rdtsc()用一个add/adc带零的实际编译函数本身,而不是像 clang 那样只在 edx:eax 中返回结果。(gcc6 和更早版本可以使用|而不是+,但__rdtsc()如果您关心来自 gcc 的 32 位代码生成,则绝对更喜欢内在的)。

回答by Jerry Coffin

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.

VC++ 对内联汇编使用完全不同的语法——但仅限于 32 位版本。64 位编译器根本不支持内联汇编。

In this case, that's probably just as well -- rdtschas (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtscbefore and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).

在这种情况下,这可能也一样——rdtsc在时序代码序列方面(至少)有两个主要问题。首先(像大多数指令一样)它可以乱序执行,所以如果你试图对一小段代码进行计时,那么rdtsc之前和之后的代码可能都在它之前执行,或者都在它之后执行,或者你有什么(我相当确定这两者将始终按照彼此的顺序执行,因此至少差异永远不会是负数)。

Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result isentirely possible.

其次,在多核(或多处理器)系统上,一个 rdtsc 可能在一个核/处理器上执行,另一个在不同的核/处理器上执行。在这种情况下,一个负的结果完全可能的。

Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.

一般来说,如果你想在 Windows 下有一个精确的计时器,你最好使用QueryPerformanceCounter.

If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:

如果您真的坚持使用rdtsc,我相信您必须在完全用汇编语言编写的单独模块中进行操作(或使用内部编译器),然后与您的 C 或 C++ 链接。我从未为 64 位模式编写过该代码,但在 32 位模式下它看起来像这样:

   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   rdtsc
   ; save eax, edx

   ; code you're going to time goes here

   xor eax, eax
   cpuid
   rdtsc

I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).

我知道这看起来很奇怪,但实际上是正确的。您执行 CPUID 是因为它是一个序列化指令(不能乱序执行)并且在用户模式下可用。您在开始计时之前执行它三次,因为英特尔记录了这样一个事实,即第一次执行可以/将以与第二次不同的速度运行(他们推荐的是三,所以是三)。

Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.

然后你执行你的测试代码,另一个 cpuid 强制序列化,最后的 rdtsc 获取代码完成后的时间。

Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.

除此之外,您想使用操作系统提供的任何方式来强制这一切都在一个进程/核心上运行。在大多数情况下,您还希望强制代码对齐——对齐的更改可能导致执行速度的相当大的差异。

Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.

最后,您想多次执行它——并且它总是有可能在事情中间被中断(例如,任务切换),因此您需要为执行需要相当多的时间做好准备比其他时间长——例如,5 次运行每次需要大约 40-43 个时钟周期,第六次运行需要 10000+ 个时钟周期。显然,在后一种情况下,您只需丢弃异常值——它不是来自您的代码。

Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you needto do before you can get results from rdtscthat will actually mean anything.

总结:设法执行 rdtsc 指令本身(几乎)是您最不担心的。在获得结果之前,您还需要做更多的事情,rdtsc这实际上意味着什么。

回答by Nik Bougalis

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:

对于 Windows,Visual Studio 提供了一个方便的“编译器内部函数”(即编译器理解的特殊函数),它为您执行 RDTSC 指令并返回结果:

unsigned __int64 __rdtsc(void);