Linux clock_gettime() 是否适合亚微秒计时?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7935518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 06:54:31  来源:igfitidea点击:

Is clock_gettime() adequate for submicrosecond timing?

linuxperformanceubuntuprofiling

提问by Crashworks

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

我需要一个高分辨率计时器用于我们应用程序的 Linux 版本中的嵌入式分析器。我们的分析器测量的范围与单个函数一样小,因此它需要一个优于 25 纳秒的计时器精度。

Previously our implementation used inline assembly and the rdtscoperation to query the high-frequency timer from the CPU directly, but this is problematicand requires frequent recalibration.

之前我们的实现使用内联汇编和rdtsc操作直接从 CPU 查询高频定时器,但这是有问题的,需要频繁重新校准。

So I tried using the clock_gettimefunction instead to query CLOCK_PROCESS_CPUTIME_ID. The docs allege this gives me nanosecond timing, but I found that the overhead of a single call to clock_gettime()was over 250ns. That makes it impossible to time events 100ns long, and having such high overhead on the timer function seriously drags down app performance, distorting the profiles beyond value. (We have hundreds of thousands of profiling nodes per second.)

所以我尝试使用该clock_gettime函数来查询 CLOCK_PROCESS_CPUTIME_ID。文档声称这给了我纳秒计时,但我发现单个调用的开销clock_gettime()超过 250ns。这使得无法对 100ns 长的事件进行计时,并且在计时器功能上具有如此高的开销会严重拖累应用程序性能,使配置文件失真超出价值。(我们每秒有数十万个分析节点。)

Is there a way to call clock_gettime()that has less than ¼μs overhead?Or is there some other way that I can reliably get the timestamp counter with <25ns overhead? Or am I stuck with using rdtsc?

有没有一种方法调用clock_gettime()开销小于 ¼μs?或者是否有其他方法可以可靠地获得 <25ns 开销的时间戳计数器?还是我坚持使用rdtsc

Below is the code I used to time clock_gettime().

下面是我用来计时的代码clock_gettime()

// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };

// time the high-frequency timer against the wall clock
{
    double fa = Get_FloatTime();
    timespec spec; 
    clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
    printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n", 
            spec.tv_sec, spec.tv_nsec );
    for ( int i = 0 ; i < TESTRUNS ; ++ i )
    {
        clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
    }
    double fb = Get_FloatTime();
    printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
        TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.

Results:

结果:

CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call

This is on a standard Ubuntu kernel. The app is a port of a Windows app (where our rdtsc inline assembly works just fine).

这是在标准的 Ubuntu 内核上。该应用程序是 Windows 应用程序的一个端口(我们的 rdtsc 内联程序集在其中工作得很好)。

Addendum:

附录:

Does x86-64 GCC have some intrinsic equivalent to __rdtsc(), so I can at least avoid inline assembly?

x86-64 GCC 是否有一些与__rdtsc()等效的内在等价物,所以我至少可以避免内联汇编?

采纳答案by David Schwartz

No. You'll have to use platform-specific code to do it. On x86 and x86-64, you can use 'rdtsc' to read the Time Stamp Counter.

不可以。您必须使用特定于平台的代码来执行此操作。在 x86 和 x86-64 上,您可以使用 'rdtsc' 来读取时间戳计数器

Just port the rdtsc assembly you're using.

只需移植您正在使用的 rdtsc 程序集。

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

回答by MartyTPS

You are calling clock_getttime with control parameter which means the api is branching through if-else tree to see what kind of time you want. I know you cant't avoid that with this call, but see if you can dig into the system code and call what the kernal is eventually calling directly. Also, I note that you are including the loop time (i++, and conditional branch).

您正在使用控制参数调用 clock_getttime,这意味着 api 正在通过 if-else 树进行分支以查看您想要什么样的时间。我知道您无法通过此调用避免这种情况,但是看看您是否可以深入研究系统代码并调用内核最终直接调用的内容。另外,我注意到您包括循环时间(i++ 和条件分支)。

回答by Brian Cain

I need a high-resolution timer for the embedded profiler in the Linux build of our application. Our profiler measures scopes as small as individual functions, so it needs a timer precision of better than 25 nanoseconds.

我需要一个高分辨率计时器用于我们应用程序的 Linux 版本中的嵌入式分析器。我们的分析器测量的范围与单个函数一样小,因此它需要一个优于 25 纳秒的计时器精度。

Have you considered oprofileor perf? You can use the performance counter hardware on your CPU to get profiling data without adding instrumentation to the code itself. You can see data per-function, or even per-line-of-code. The "only" drawback is that it won't measure wall clock time consumed, it will measure CPU time consumed, so it's not appropriate for all investigations.

你有没有考虑过oprofileperf?您可以使用 CPU 上的性能计数器硬件来获取分析数据,而无需向代码本身添加检测。您可以查看每个函数甚至每个代码行的数据。“唯一”的缺点是它不会测量消耗的挂钟时间,而是测量消耗的 CPU 时间,因此它不适用于所有调查。

回答by user2548100

Give clockid_t CLOCK_MONOTONIC_RAW a try?

试试clockid_t CLOCK_MONOTONIC_RAW?

CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific) Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

CLOCK_MONOTONIC_RAW(自 Linux 2.6.28 起;Linux 特定)与 CLOCK_MONOTONIC 类似,但提供对基于硬件的原始时间的访问,该时间不受 NTP 调整或 adjtime(3) 执行的增量调整的影响。

From Man7.org

来自Man7.org

回答by ChrisWue

I ran some benchmarks on my system which is a quad core E5645 Xeon supporting a constant TSC running kernel 3.2.54 and the results were:

我在我的系统上运行了一些基准测试,该系统是一个支持恒定 TSC 运行内核 3.2.54 的四核 E5645 Xeon,结果是:

clock_gettime(CLOCK_MONOTONIC_RAW)       100ns/call
clock_gettime(CLOCK_MONOTONIC)           25ns/call
clock_gettime(CLOCK_REALTIME)            25ns/call
clock_gettime(CLOCK_PROCESS_CPUTIME_ID)  400ns/call
rdtsc (implementation @DavidSchwarz)     600ns/call

So it looks like on a reasonably modern system the (accepted answer) rdtsc is the worst route to go down.

所以看起来在一个相当现代的系统上(接受的答案)rdtsc 是最糟糕的下降路线。

回答by BeeOnRope

It's hard to give a global applicable answer because the hardware and software implementation will widely.

很难给出一个全球适用的答案,因为硬件和软件的实现会很广泛。

However, yes, most modern platforms will have a suitable clock_gettimecall that is implemented purely in user-space using the VDSO mechanism, and will reliably take something like 20 to 30 nanoseconds to complete.

但是,是的,大多数现代平台都会有一个合适的clock_gettime调用,该调用完全在用户空间中使用 VDSO 机制实现,并且可靠地需要大约 20 到 30 纳秒才能完成。

Internally, this is using rdtscor rdtscpfor the fine-grained portion of the time-keeping, plus adjustments to keep this in sync with wall-clock time (depending on the clock you choose) and a multiplication to convert from whatever units rdtschas on your platform to nanoseconds.

在内部,这是使用rdtscrdtscp用于计时的细粒度部分,加上调整以使其与挂钟时间同步(取决于您选择的时钟)和乘法以从rdtsc您平台上的任何单位转换到纳秒。

Not allof the clocks offered by clock_gettimewill implement this fast method, and it's not always obviouswhich ones do. Usually CLOCK_MONOTONICis a good option, but you should test this on your own system.

并非所有提供的时钟clock_gettime都将实现这种快速方法,而且并不总是很明显。通常CLOCK_MONOTONIC是一个不错的选择,但您应该在自己的系统上进行测试