Linux 使用 RDTSC 获取 cpu 周期 - 为什么 RDTSC 的值总是增加?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8602336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
提问by user1106106
I want to get the CPU cycles at a specific point. I use this function at that point:
我想在特定点获得 CPU 周期。我当时使用这个函数:
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
// broken for 64-bit builds; don't copy this code
return x;
}
(editor's note: "=A"
is wrong for x86-64; it picks eitherRDX or RAX. Only in 32-bit mode will it pick the EDX:EAX output you want. See How to get the CPU cycle count in x86_64 from C++?.)
(编者注:"=A"
是错的x86-64,捡起任何RDX或RAX仅在32位模式下将它挑EDX:你想EAX输出中看到的。如何从C获取CPU周期数在x86_64的++? )。
The problem is that it returns always an increasingnumber (in every run). It's as if it is referring to the absolute time.
问题是它返回的数字总是增加(在每次运行中)。就好像它指的是绝对时间。
Am I using the functions incorrectly?
我是否错误地使用了这些功能?
采纳答案by Damon
As long as your thread stays on the same CPU core, the RDTSC instruction will keep returning an increasing number until it wraps around. For a 2GHz CPU, this happens after 292 years, so it is not a real issue. You probably won't see it happen. If you expect to live that long, make sure your computer reboots, say, every 50 years.
只要您的线程保持在同一个 CPU 内核上,RDTSC 指令就会不断返回递增的数字,直到它回绕为止。对于 2GHz CPU,这会在 292 年后发生,因此这不是真正的问题。你可能不会看到它发生。如果您希望活那么久,请确保您的计算机重新启动,例如每 50 年一次。
The problem with RDTSC is that you have no guarantee that it starts at the same point in time on all cores of an elderly multicore CPU and no guarantee that it starts at the same point in time time on all CPUs on an elderly multi-CPU board.
Modern systems usually do not have such problems, but the problem can also be worked around on older systems by setting a thread's affinity so it only runs on one CPU. This is not good for application performance, so one should not generally do it, but for measuring ticks, it's just fine.
RDTSC 的问题在于您无法保证它在旧多核 CPU 的所有内核上的同一时间点启动,也不能保证它在旧多核 CPU 板上的所有 CPU 上的同一时间点启动.
现代系统通常没有这样的问题,但在旧系统上也可以通过设置线程的亲和性来解决这个问题,这样它只能在一个 CPU 上运行。这对应用程序性能不利,因此通常不应该这样做,但是对于测量滴答声来说,这很好。
(Another "problem" is that many people use RDTSC for measuring time, which is notwhat it does, but you wrote that you want CPU cycles, so that is fine. If you douse RDTSC to measure time, you may have surprises when power saving or hyperboost or whatever the multitude of frequency-changing techniques are called kicks in. For actual time, the clock_gettime
syscall is surprisingly good under Linux.)
(另一个“问题”是很多人用RDTSC来测量时间,这不是它所做的,但是你写了你想要CPU周期,所以没关系。如果你确实使用RDTSC来测量时间,你可能会有惊喜省电或超升压或任何被称为启动的众多变频技术。实际上,clock_gettime
系统调用在 Linux 下非常好。)
I would just write rdtsc
inside the asm
statement, which works just fine for me and is more readable than some obscure hex code. Assuming it's the correct hex code (and since it neither crashes and returns an ever-increasing number, it seems so), your code is good.
我只会写rdtsc
在asm
语句中,这对我来说很好用,并且比一些晦涩的十六进制代码更具可读性。假设它是正确的十六进制代码(并且因为它既不会崩溃也不会返回不断增加的数字,看起来是这样),您的代码很好。
If you want to measure the number of ticks a piece of code takes, you want a tick difference, you just need to subtract two values of the ever-increasing counter. Something like uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
Note that for if very accurate measurements isolated from surrounding code are necessary, you need to serialize, that is stall the pipeline, prior to calling rdtsc
(or use rdtscp
which is only supported on newer processors). The one serializing instruction that can be used at every privilegue level is cpuid
.
如果你想测量一段代码需要的滴答数,你想要一个滴答差异,你只需要减去不断增加的计数器的两个值。喜欢的东西uint64_t t0 = rdtsc(); ... uint64_t t1 = rdtsc() - t0;
注意,因为如果从周围的代码分离的非常精确的测量是必要的,你需要序列化,这是失速的管道,调用之前rdtsc
(或使用rdtscp
其仅支持较新的处理器)。可以在每个特权级别使用的一个序列化指令是cpuid
.
In reply to the further question in the comment:
回复评论中的进一步问题:
The TSC starts at zero when you turn on the computer (and the BIOS resets all counters on all CPUs to the same value, though some BIOSes a few years ago did not do so reliably).
当您打开计算机时,TSC 从零开始(并且 BIOS 将所有 CPU 上的所有计数器重置为相同的值,尽管几年前的某些 BIOS 没有这样做可靠)。
Thus, from your program's point of view, the counter started "some unknown time in the past", and it always increases with every clock tick the CPU sees. Therefore if you execute the instruction returning that counter now and any time later in a different process, it will return a greater value (unless the CPU was suspended or turned off in between). Different runs of the same program get bigger numbers, because the counter keeps growing. Always.
因此,从您的程序的角度来看,计数器从“过去的某个未知时间”开始,并且它总是随着 CPU 看到的每个时钟滴答而增加。因此,如果您现在和以后在不同进程中执行返回该计数器的指令,它将返回一个更大的值(除非 CPU 在这期间被挂起或关闭)。同一程序的不同运行得到更大的数字,因为计数器不断增长。总是。
Now, clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
is a different matter. This is the CPU time that the OS has given to the process. It starts at zero when your process starts. A new process starts at zero, too. Thus, two processes running after each other will get very similar or identical numbers, not ever growing ones.
现在,clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
是另一回事。这是操作系统给予进程的 CPU 时间。当您的流程开始时,它从零开始。一个新的过程也从零开始。因此,两个相互运行的进程将获得非常相似或相同的数字,而不是不断增长的数字。
clock_gettime(CLOCK_MONOTONIC_RAW)
is closer to how RDTSC works (and on some older systems is implemented with it). It returns a value that ever increases. Nowadays, this is typically a HPET. However, this is really time, and not ticks. If your computer goes into low power state (e.g. running at 1/2 normal frequency), it will stilladvance at the same pace.
clock_gettime(CLOCK_MONOTONIC_RAW)
更接近 RDTSC 的工作方式(并且在一些较旧的系统上是用它实现的)。它返回一个不断增加的值。如今,这通常是 HPET。然而,这真的是时间,而不是滴答声。如果您的计算机进入低功耗状态(例如以 1/2 正常频率运行),它仍会以相同的速度前进。
回答by Brendan
There's lots of confusing and/or wrong information about the TSC out there, so I thought I'd try to clear some of it up.
关于 TSC 有很多令人困惑和/或错误的信息,所以我想我会尝试清除其中的一些信息。
When Intel first introduced the TSC (in original Pentium CPUs) it was clearly documented to count cycles (and not time). However, back then CPUs mostly ran at a fixed frequency, so some people ignored the documented behaviour and used it to measure time instead (most notably, Linux kernel developers). Their code broke in later CPUs that don't run at a fixed frequency (due to power management, etc). Around that time other CPU manufacturers (AMD, Cyrix, Transmeta, etc) were confused and some implemented TSC to measure cycles and some implemented it so it measured time, and some made it configurable (via. an MSR).
当 Intel 首次引入 TSC(在最初的 Pentium CPU 中)时,它被清楚地记录为计数周期(而不是时间)。然而,当时的 CPU 大多以固定频率运行,所以有些人忽略了记录的行为,而是用它来测量时间(最著名的是 Linux 内核开发人员)。他们的代码在不以固定频率运行的后来的 CPU 中被破坏(由于电源管理等)。大约在那个时候,其他 CPU 制造商(AMD、Cyrix、Transmeta 等)感到困惑,一些实施了 TSC 来测量周期,一些实施它以测量时间,还有一些使其可配置(通过 MSR)。
Then "multi-chip" systems became more common for servers; and even later multi-core was introduced. This led to minor differences between TSC values on different cores (due to different startup times); but more importantly it also led to major differences between TSC values on different CPUs caused by CPUs running at different speeds (due to power management and/or other factors).
然后“多芯片”系统在服务器中变得更加普遍;甚至后来引入了多核。这导致不同内核上的 TSC 值之间存在细微差异(由于不同的启动时间);但更重要的是,由于 CPU 以不同的速度运行(由于电源管理和/或其他因素),这也导致了不同 CPU 上 TSC 值之间的重大差异。
People that were trying to use it wrong from the start (people who used it to measure time and not cycles) complained a lot, and eventually convinced CPU manufacturers to standardise on making the TSC measure time and not cycles.
从一开始就试图错误地使用它的人(用它来测量时间而不是周期的人)抱怨很多,并最终说服 CPU 制造商标准化使 TSC 测量时间而不是周期。
Of course this was a mess - e.g. it takes a lot of code just to determine what the TSC actually measures if you support all 80x86 CPUs; and different power management technologies (including things like SpeedStep, but also things like sleep states) may effect TSC in different ways on different CPUs; so AMD introduced a "TSC invariant" flag in CPUID to tell the OS that the TSC can be used to measure time correctly.
当然,这是一团糟——例如,如果您支持所有 80x86 CPU,就需要大量代码来确定 TSC 实际测量的内容;不同的电源管理技术(包括 SpeedStep 之类的东西,还有睡眠状态之类的东西)可能会在不同的 CPU 上以不同的方式影响 TSC;因此 AMD 在 CPUID 中引入了“TSC 不变”标志,以告诉操作系统 TSC 可用于正确测量时间。
All recent Intel and AMD CPUs have been like this for a while now - TSC counts time and doesn't measure cycles at all. This means if you want to measure cycles you had to use (model specific) performance monitoring counters. Unfortunately the performance monitoring counters are an even worse mess (due to their model specific nature and convoluted configuration).
一段时间以来,所有最近的 Intel 和 AMD CPU 都是这样的 - TSC 计算时间,根本不测量周期。这意味着如果您想测量周期,您必须使用(特定于模型的)性能监控计数器。不幸的是,性能监控计数器更糟(由于它们的模型特定性质和复杂的配置)。
回答by galois
good answers already, and Damon already mentioned this in a way in his answer, but I'll add this from the actual x86 manual (volume 2, 4-301) entry for RDTSC:
已经有了很好的答案,Damon 已经在他的回答中以某种方式提到了这一点,但我将从 RDTSC 的实际 x86 手册(第 2、4-301 卷)条目中添加这一点:
Loads the current value of the processor's time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits. (On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)
The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.See "Time Stamp Counter" in Chapter 17 of the Intel? 64 and IA-32 Architectures Software Developer's Manual, Volume 3B, for specific details of the time stamp counter behavior.
将处理器的时间戳计数器(64 位 MSR)的当前值加载到 EDX:EAX 寄存器中。EDX 寄存器装入 MSR 的高 32 位,EAX 寄存器装入低 32 位。(在支持 Intel 64 架构的处理器上,RAX 和 RDX 各自的高 32 位被清除。)
处理器在每个时钟周期单调递增时间戳计数器 MSR,并在处理器复位时将其复位为 0。参见Intel第 17 章中的“时间戳计数器” ?64 和 IA-32 架构软件开发人员手册,第 3B 卷,了解时间戳计数器行为的具体细节。