Linux Why is clock_gettime so erratic?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6814792/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 05:15:26  来源:igfitidea点击:

Why is clock_gettime so erratic?

linuxtimeprofiling

提问by sinharaj

Intro

Intro

  • Section Old Questioncontains the initial question (Further Investigationand Conclusionhave been added since).

  • Skip to the section Further Investigationbelow for a detailed comparison of the different timing methods (rdtsc, clock_gettimeand QueryThreadCycleTime).

  • I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).

  • The code used for testing is at the bottom of this question (see section Appendix).

  • Apologies for the length.

  • Section Old Questioncontains the initial question (Further Investigationand Conclusionhave been added since).

  • Skip to the section Further Investigationbelow for a detailed comparison of the different timing methods (rdtsc, clock_gettimeand QueryThreadCycleTime).

  • I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).

  • The code used for testing is at the bottom of this question (see section Appendix).

  • Apologies for the length.



Old Question

Old Question

In short: I am using clock_gettimeto measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).

In short: I am using clock_gettimeto measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).

Question: Is there a reason why clock_gettimewould give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?

Question: Is there a reason why clock_gettimewould give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?

Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundredsof measurements.

Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundredsof measurements.

I also have to measure only the time the thread actually spends executing (which is why rdtscis not suitable). I also need a high resolution (which is why timesis not suitable).

I also have to measure only the time the thread actually spends executing (which is why rdtscis not suitable). I also need a high resolution (which is why timesis not suitable).

I've tried the following methods:

I've tried the following methods:

  • rdtsc(on Linux and Windows),

  • clock_gettime(with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and

  • QueryThreadCycleTime(on Windows).

  • rdtsc(on Linux and Windows),

  • clock_gettime(with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and

  • QueryThreadCycleTime(on Windows).

Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).

Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).

I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.

I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.

Results:

Results:

A comparison of clock_gettimewith rdtscfor the same code segment (25 runs of 101 measurements = 2525 readings):

A comparison of clock_gettimewith rdtscfor the same code segment (25 runs of 101 measurements = 2525 readings):

  • clock_gettime:

    • 1881 measurements of 11 ns,
    • 595 measurements were (distributed almost normally) between 3369 and 3414 ns,
    • 2 measurements of 11680 ns,
    • 1 measurement of 1506022 ns, and
    • the rest is between 900 and 5000 ns.

    • Min: 11 ns

    • Max: 1506022 ns
    • Mean: 1471.862 ns
    • Median: 11 ns
    • Mode: 11 ns
    • Stddev: 29991.034
  • rdtsc(note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):

    • 1178 measurements between 274 and 325 ticks,
    • 306 measurements between 326 and 375 ticks,
    • 910 measurements between 376 and 425 ticks,
    • 129 measurements between 426 and 990 ticks,
    • 1 measurement of 1240 ticks, and
    • 1 measurement of 1256 ticks.

    • Min: 274 ticks

    • Max: 1256 ticks
    • Mean: 355.806 ticks
    • Median: 333 ticks
    • Mode: 376 ticks
    • Stddev: 83.896
  • clock_gettime:

    • 1881 measurements of 11 ns,
    • 595 measurements were (distributed almost normally) between 3369 and 3414 ns,
    • 2 measurements of 11680 ns,
    • 1 measurement of 1506022 ns, and
    • the rest is between 900 and 5000 ns.

    • Min: 11 ns

    • Max: 1506022 ns
    • Mean: 1471.862 ns
    • Median: 11 ns
    • Mode: 11 ns
    • Stddev: 29991.034
  • rdtsc(note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):

    • 1178 measurements between 274 and 325 ticks,
    • 306 measurements between 326 and 375 ticks,
    • 910 measurements between 376 and 425 ticks,
    • 129 measurements between 426 and 990 ticks,
    • 1 measurement of 1240 ticks, and
    • 1 measurement of 1256 ticks.

    • Min: 274 ticks

    • Max: 1256 ticks
    • Mean: 355.806 ticks
    • Median: 333 ticks
    • Mode: 376 ticks
    • Stddev: 83.896

Discussion:

Discussion:

  • rdtscgives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).

  • QueryThreadCycleTimegives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.

  • clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).

  • rdtscgives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).

  • QueryThreadCycleTimegives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.

  • clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).

I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?

I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?



Further Investigation

Further Investigation

I have investigated this a bit further. I have done two things:

I have investigated this a bit further. I have done two things:

  1. Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)(see code 1 in Appendix), and

  2. in a plain loop called clock_gettimeand stored the readings into an array (see code 2 in Appendix). I measure the delta times(difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).

  1. Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)(see code 1 in Appendix), and

  2. in a plain loop called clock_gettimeand stored the readings into an array (see code 2 in Appendix). I measure the delta times(difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).

I have measured it on two different computers with two different Linux Kernel versions:

I have measured it on two different computers with two different Linux Kernel versions:

CGT:

CGT:

  1. CPU: Core 2 Duo L9400 @ 1.86GHz

    Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

    Results:

    • Estimated clock_gettimeoverhead: between 690-710 ns
    • Delta times:

      • Average: 815.22 ns
      • Median: 713 ns
      • Mode: 709 ns
      • Min: 698 ns
      • Max: 23359 ns
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
          697 < x ≤ 800   ->     78111  <-- cached?
          800 < x ≤ 1000  ->     16412
         1000 < x ≤ 1500  ->         3
         1500 < x ≤ 2000  ->      4836  <-- uncached?
         2000 < x ≤ 3000  ->       305
         3000 < x ≤ 5000  ->       161
         5000 < x ≤ 10000 ->       105
        10000 < x ≤ 15000 ->        53
        15000 < x ≤ 20000 ->         8
        20000 < x         ->         5
        
  2. CPU: 4 × Dual Core AMD Opteron Processor 275

    Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

    Results:

    • Estimated clock_gettimeoverhead: between 279-283 ns
    • Delta times:

      • Average: 320.00
      • Median: 1
      • Mode: 1
      • Min: 1
      • Max: 3495529
      • Histogram (left-out ranges have frequencies of 0):

              Range         |  Frequency
        --------------------+-----------
                  x ≤ 1     ->     86738  <-- cached?
            282 < x ≤ 300   ->     13118  <-- uncached?
            300 < x ≤ 440   ->        78
           2000 < x ≤ 5000  ->        52
           5000 < x ≤ 30000 ->         5
        3000000 < x         ->         8
        
  1. CPU: Core 2 Duo L9400 @ 1.86GHz

    Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

    Results:

    • Estimated clock_gettimeoverhead: between 690-710 ns
    • Delta times:

      • Average: 815.22 ns
      • Median: 713 ns
      • Mode: 709 ns
      • Min: 698 ns
      • Max: 23359 ns
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
          697 < x ≤ 800   ->     78111  <-- cached?
          800 < x ≤ 1000  ->     16412
         1000 < x ≤ 1500  ->         3
         1500 < x ≤ 2000  ->      4836  <-- uncached?
         2000 < x ≤ 3000  ->       305
         3000 < x ≤ 5000  ->       161
         5000 < x ≤ 10000 ->       105
        10000 < x ≤ 15000 ->        53
        15000 < x ≤ 20000 ->         8
        20000 < x         ->         5
        
  2. CPU: 4 × Dual Core AMD Opteron Processor 275

    Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

    Results:

    • Estimated clock_gettimeoverhead: between 279-283 ns
    • Delta times:

      • Average: 320.00
      • Median: 1
      • Mode: 1
      • Min: 1
      • Max: 3495529
      • Histogram (left-out ranges have frequencies of 0):

              Range         |  Frequency
        --------------------+-----------
                  x ≤ 1     ->     86738  <-- cached?
            282 < x ≤ 300   ->     13118  <-- uncached?
            300 < x ≤ 440   ->        78
           2000 < x ≤ 5000  ->        52
           5000 < x ≤ 30000 ->         5
        3000000 < x         ->         8
        

RDTSC:

RDTSC:

Related code rdtsc_delta.cand rdtsc_overhead.c.

Related code rdtsc_delta.cand rdtsc_overhead.c.

  1. CPU: Core 2 Duo L9400 @ 1.86GHz

    Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

    Results:

    • Estimated overhead: between 39-42 ticks
    • Delta times:

      • Average: 52.46 ticks
      • Median: 42 ticks
      • Mode: 42 ticks
      • Min: 35 ticks
      • Max: 28700 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
           34 < x ≤ 35    ->     16240  <-- cached?
           41 < x ≤ 42    ->     63585  <-- uncached? (small difference)
           48 < x ≤ 49    ->     19779  <-- uncached?
           49 < x ≤ 120   ->       195
         3125 < x ≤ 5000  ->       144
         5000 < x ≤ 10000 ->        45
        10000 < x ≤ 20000 ->         9
        20000 < x         ->         2
        
  2. CPU: 4 × Dual Core AMD Opteron Processor 275

    Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

    Results:

    • Estimated overhead: between 13.7-17.0 ticks
    • Delta times:

      • Average: 35.44 ticks
      • Median: 16 ticks
      • Mode: 16 ticks
      • Min: 14 ticks
      • Max: 16372 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
           13 < x ≤ 14    ->       192
           14 < x ≤ 21    ->     78172  <-- cached?
           21 < x ≤ 50    ->     10818
           50 < x ≤ 103   ->     10624  <-- uncached?
         5825 < x ≤ 6500  ->        88
         6500 < x ≤ 8000  ->        88
         8000 < x ≤ 10000 ->        11
        10000 < x ≤ 15000 ->         4
        15000 < x ≤ 16372 ->         2
        
  1. CPU: Core 2 Duo L9400 @ 1.86GHz

    Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

    Results:

    • Estimated overhead: between 39-42 ticks
    • Delta times:

      • Average: 52.46 ticks
      • Median: 42 ticks
      • Mode: 42 ticks
      • Min: 35 ticks
      • Max: 28700 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
           34 < x ≤ 35    ->     16240  <-- cached?
           41 < x ≤ 42    ->     63585  <-- uncached? (small difference)
           48 < x ≤ 49    ->     19779  <-- uncached?
           49 < x ≤ 120   ->       195
         3125 < x ≤ 5000  ->       144
         5000 < x ≤ 10000 ->        45
        10000 < x ≤ 20000 ->         9
        20000 < x         ->         2
        
  2. CPU: 4 × Dual Core AMD Opteron Processor 275

    Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

    Results:

    • Estimated overhead: between 13.7-17.0 ticks
    • Delta times:

      • Average: 35.44 ticks
      • Median: 16 ticks
      • Mode: 16 ticks
      • Min: 14 ticks
      • Max: 16372 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
           13 < x ≤ 14    ->       192
           14 < x ≤ 21    ->     78172  <-- cached?
           21 < x ≤ 50    ->     10818
           50 < x ≤ 103   ->     10624  <-- uncached?
         5825 < x ≤ 6500  ->        88
         6500 < x ≤ 8000  ->        88
         8000 < x ≤ 10000 ->        11
        10000 < x ≤ 15000 ->         4
        15000 < x ≤ 16372 ->         2
        

QTCT:

QTCT:

Related code qtct_delta.cand qtct_overhead.c.

Related code qtct_delta.cand qtct_overhead.c.

  1. CPU: Core 2 6700 @ 2.66GHz

    Kernel: Windows 7 64-bit

    Results:

    • Estimated overhead: between 890-940 ticks
    • Delta times:

      • Average: 1057.30 ticks
      • Median: 890 ticks
      • Mode: 890 ticks
      • Min: 880 ticks
      • Max: 29400 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
          879 < x ≤ 890   ->     71347  <-- cached?
          895 < x ≤ 1469  ->       844
         1469 < x ≤ 1600  ->     27613  <-- uncached?
         1600 < x ≤ 2000  ->        55
         2000 < x ≤ 4000  ->        86
         4000 < x ≤ 8000  ->        43
         8000 < x ≤ 16000 ->        10
        16000 < x         ->         1
        
  1. CPU: Core 2 6700 @ 2.66GHz

    Kernel: Windows 7 64-bit

    Results:

    • Estimated overhead: between 890-940 ticks
    • Delta times:

      • Average: 1057.30 ticks
      • Median: 890 ticks
      • Mode: 890 ticks
      • Min: 880 ticks
      • Max: 29400 ticks
      • Histogram (left-out ranges have frequencies of 0):

              Range       |  Frequency
        ------------------+-----------
          879 < x ≤ 890   ->     71347  <-- cached?
          895 < x ≤ 1469  ->       844
         1469 < x ≤ 1600  ->     27613  <-- uncached?
         1600 < x ≤ 2000  ->        55
         2000 < x ≤ 4000  ->        86
         4000 < x ≤ 8000  ->        43
         8000 < x ≤ 16000 ->        10
        16000 < x         ->         1
        


Conclusion

Conclusion

I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).

I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).

The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettimetook less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)

The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettimetook less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)

Remarks:

Remarks:

  • The overhead of CGT and QTCT is quite big.

  • It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.

  • Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...

  • The overhead of CGT and QTCT is quite big.

  • It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.

  • Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...



Appendix

Appendix

Code 1: clock_gettime_overhead.c

Code 1: clock_gettime_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
    ./clock_gettime_overhead 100000
*/

int main(int argc, char **args) {
    struct timespec tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
    }
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
    printf("Estimated overhead: %lld ns\n",
            ((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
                    - ((int64_t) tstart.tv_sec * 1000000000
                            + (int64_t) tstart.tv_nsec)) / N / 10);
    return 0;
}

Code 2: clock_gettime_delta.c

Code 2: clock_gettime_delta.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
    ./clock_gettime_delta > results
*/

#define N 100000

int main(int argc, char **args) {
    struct timespec sample, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
        results[n] = sample;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n",
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t)results[n].tv_nsec,
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t) results[n].tv_nsec - 
                   ((int64_t) results[n-1].tv_sec * 1000000000 + 
                        (int64_t)results[n-1].tv_nsec));
    }
    return 0;
}

Code 3: rdtsc.h

Code 3: rdtsc.h

static uint64_t rdtsc() {
#if defined(__GNUC__)
#   if defined(__i386__)
    uint64_t x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
#   elif defined(__x86_64__)
    uint32_t hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)lo) | ((uint64_t)hi << 32);
#   else
#       error Unsupported architecture.
#   endif
#elif defined(_MSC_VER)
    return __rdtsc();
#else
#   error Other compilers not supported...
#endif
}

Code 4: rdtsc_delta.c

Code 4: rdtsc_delta.c

#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_delta.c -O0 -o rdtsc_delta
    ./rdtsc_delta > rdtsc_delta_results

Windows:

    cl -Od rdtsc_delta.c
    rdtsc_delta.exe > windows_rdtsc_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t results[N];
    int n;
    for (n = 0; n < N; ++n) {
        results[n] = rdtsc();
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

Code 5: rdtsc_overhead.c

Code 5: rdtsc_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
    ./rdtsc_overhead 1000000 > rdtsc_overhead_results

Windows:

    cl -Od rdtsc_overhead.c
    rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    tstart = rdtsc();
    for (n = 0; n < N; ++n) {
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
    }
    tend = rdtsc();
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

Code 6: qtct_delta.c

Code 6: qtct_delta.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_delta.c
    qtct_delta.exe > windows_qtct_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t ticks, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        results[n] = ticks;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

Code 7: qtct_overhead.c

Code 7: qtct_overhead.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_overhead.c
    qtct_overhead.exe 1000000
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, ticks;
    int n, N;
    N = atoi(args[1]);
    QueryThreadCycleTime(GetCurrentThread(), &tstart);
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
    }
    QueryThreadCycleTime(GetCurrentThread(), &tend);
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

回答by TomH

Well as CLOCK_THREAD_CPUTIME_IDis implemented using rdtscit will likely suffer from the same problems as it. The manual page for clock_gettimesays:

Well as CLOCK_THREAD_CPUTIME_IDis implemented using rdtscit will likely suffer from the same problems as it. The manual page for clock_gettimesays:

The CLOCK_PROCESS_CPUTIME_IDand CLOCK_THREAD_CPUTIME_IDclocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus resultsif a process is migrated to another CPU.

The CLOCK_PROCESS_CPUTIME_IDand CLOCK_THREAD_CPUTIME_IDclocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus resultsif a process is migrated to another CPU.

Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?

Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?

回答by Mike Dunlavey

When you have a highly skewed distribution that cannot go negative, you're going to see large discrepancies between mean, median, and mode. The standard deviation is fairly meaningless for such a distribution.

When you have a highly skewed distribution that cannot go negative, you're going to see large discrepancies between mean, median, and mode. The standard deviation is fairly meaningless for such a distribution.

It's usually a good idea to log-transform it. That will make it "more normal".

It's usually a good idea to log-transform it. That will make it "more normal".