C语言如何测量 ARM Cortex-A8 处理器中的程序执行时间？

Question

提问by HaggarTheHorrible

I'm using an ARM Cortex-A8 based processor called as i.MX515. There is linux Ubuntu 9.10 distribution. I'm running a very big application written in C and I'm making use of gettimeofday();functions to measure the time my application takes.

我正在使用名为 i.MX515 的基于 ARM Cortex-A8 的处理器。有 linux Ubuntu 9.10 发行版。我正在运行一个用 C 编写的非常大的应用程序，我正在使用gettimeofday();函数来测量我的应用程序花费的时间。

main()

{

gettimeofday(start);
....
....
....
gettimeofday(end);

}

This method was sufficient to look at what blocks of my application was taking what amount of time. But, now that, I'm trying to optimize my code very throughly, with the gettimeofday() method of calculating time, I see a lot of fluctuation between successive runs (Run before and after my optimizations), so I'm not able to determine the actual execution times, hence the impact of my improvements.

这种方法足以查看我的应用程序的哪些块占用了多少时间。但是，现在，我正在尝试非常彻底地优化我的代码，使用 gettimeofday() 计算时间的方法，我看到连续运行之间有很多波动（在优化之前和之后运行），所以我无法确定实际执行时间，从而确定我的改进的影响。

Can anyone suggest me what I should do?

谁能建议我应该怎么做？

If by accessing the cycle counter (Idea suggested on ARM website for Cortex-M3) can anyone point me to some code which gives me the steps I have to follow to access the timer registers on Cortex-A8?

如果通过访问循环计数器（在 ARM 网站上为 Cortex-M3 建议的 Idea），谁能指出我一些代码，这些代码为我提供了访问Cortex-A8 上的定时器寄存器所必须遵循的步骤？

If this method is not very accurate then please suggest some alternatives.

如果这种方法不是很准确，那么请提出一些替代方案。

Thanks

谢谢

Follow ups

跟进

Follow up 1: Wrote the following program on Code Sorcery, the executable was generated which when I tried running on the board, I got - Illegal instruction message :(

跟进 1：在 Code Sorcery 上编写了以下程序，生成了可执行文件，当我尝试在板上运行时，我得到了 - 非法指令消息:(

static inline unsigned int get_cyclecount (void)
{
    unsigned int value;
    // Read CCNT Register
    asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
    return value;
}

static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
    // in general enable all counters (including cycle counter)
    int32_t value = 1;

    // peform reset:
    if (do_reset)
    {
    value |= 2;     // reset all counters to zero.
    value |= 4;     // reset cycle counter to zero.
    }

    if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

    value |= 16;

    // program the performance-counter control-register:
    asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));

    // enable all counters:
    asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));

    // clear overflows:
    asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}



int main()
{

    /* enable user-mode access to the performance counter*/
asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));

/* disable counter overflow interrupts (just in case)*/
asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

    init_perfcounters (1, 0);

    // measure the counting overhead:
    unsigned int overhead = get_cyclecount();
    overhead = get_cyclecount() - overhead;

    unsigned int t = get_cyclecount();

    // do some stuff here..
    printf("\nHello World!!");

    t = get_cyclecount() - t;

    printf ("function took exactly %d cycles (including function call) ", t - overhead);

    get_cyclecount();

    return 0;
}

Follow up 2: I had written to Freescale for support and they have sent me back the following reply and a program (I did not quite understand much from it)

跟进 2：我已经写信给飞思卡尔寻求支持，他们给我发回了以下回复和一个程序（我不太了解它）

Here is what we can help you with right now: I am sending you attach an example of code, that sends an stream using the UART, from what your code, it seems that you are not init correctly the MPU.

以下是我们现在可以为您提供的帮助：我向您发送了一个代码示例，该示例使用 UART 发送一个流，从您的代码来看，您似乎没有正确初始化 MPU。

(hash)include <stdio.h>
(hash)include <stdlib.h>

(hash)define BIT13 0x02000

(hash)define R32   volatile unsigned long *
(hash)define R16   volatile unsigned short *
(hash)define R8   volatile unsigned char *

(hash)define reg32_UART1_USR1     (*(R32)(0x73FBC094))
(hash)define reg32_UART1_UTXD     (*(R32)(0x73FBC040))

(hash)define reg16_WMCR         (*(R16)(0x73F98008))
(hash)define reg16_WSR              (*(R16)(0x73F98002))

(hash)define AIPS_TZ1_BASE_ADDR             0x70000000
(hash)define IOMUXC_BASE_ADDR               AIPS_TZ1_BASE_ADDR+0x03FA8000

typedef unsigned long  U32;
typedef unsigned short U16;
typedef unsigned char  U8;


void serv_WDOG()
{
    reg16_WSR = 0x5555;
    reg16_WSR = 0xAAAA;
}


void outbyte(char ch)
{
    while( !(reg32_UART1_USR1 & BIT13)  );

    reg32_UART1_UTXD = ch ;
}


void _init()
{

}



void pause(int time) 
{
    int i;

    for ( i=0 ; i < time ;  i++);

} 


void led()
{

//Write to Data register [DR]

    *(R32)(0x73F88000) = 0x00000040;  // 1 --> GPIO 2_6 
    pause(500000);

    *(R32)(0x73F88000) = 0x00000000;  // 0 --> GPIO 2_6 
    pause(500000);


}

void init_port_for_led()
{


//GPIO 2_6   [73F8_8000] EIM_D22  (AC11)    DIAG_LED_GPIO
//ALT1 mode
//IOMUXC_SW_MUX_CTL_PAD_EIM_D22  [+0x0074]
//MUX_MODE [2:0]  = 001: Select mux mode: ALT1 mux port: GPIO[6] of instance: gpio2.

 // IOMUXC control for GPIO2_6

*(R32)(IOMUXC_BASE_ADDR + 0x74) = 0x00000001; 

//Write to DIR register [DIR]

*(R32)(0x73F88004) = 0x00000040;  // 1 : GPIO 2_6  - output

*(R32)(0x83FDA090) = 0x00003001;
*(R32)(0x83FDA090) = 0x00000007;


}

int main ()
{
  int k = 0x12345678 ;

    reg16_WMCR = 0 ;                        // disable watchdog
    init_port_for_led() ;

    while(1)
    {
        printf("Hello word %x\n\r", k ) ;
        serv_WDOG() ;
        led() ;

    }

    return(1) ;
}

Answer 1

回答by Nils Pipenbrinck

Accessing the performance counters isn't difficult, but you have to enable them from kernel-mode. By default the counters are disabled.

访问性能计数器并不困难，但您必须从内核模式启用它们。默认情况下，计数器被禁用。

In a nutshell you have to execute the following two lines inside the kernel. Either as a loadable module or just adding the two lines somewhere in the board-init will do:

简而言之，您必须在内核中执行以下两行。无论是作为可加载模块还是只是在 board-init 的某处添加两行都可以：

  /* enable user-mode access to the performance counter*/
  asm ("MCR p15, 0, %0, C9, C14, 0\n\t" :: "r"(1));

  /* disable counter overflow interrupts (just in case)*/
  asm ("MCR p15, 0, %0, C9, C14, 2\n\t" :: "r"(0x8000000f));

Once you did this the cycle counter will start incrementing for each cycle. Overflows of the register will go unnoticed and don't cause any problems (except they might mess up your measurements).

完成此操作后，循环计数器将开始为每个循环递增。寄存器的溢出不会被注意到并且不会引起任何问题（除非它们可能会扰乱您的测量）。

Now you want to access the cycle-counter from the user-mode:

现在你想从用户模式访问循环计数器：

We start with a function that reads the register:

我们从读取寄存器的函数开始：

static inline unsigned int get_cyclecount (void)
{
  unsigned int value;
  // Read CCNT Register
  asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));  
  return value;
}

And you most likely want to reset and set the divider as well:

而且您很可能还想重置和设置分隔符：

static inline void init_perfcounters (int32_t do_reset, int32_t enable_divider)
{
  // in general enable all counters (including cycle counter)
  int32_t value = 1;

  // peform reset:  
  if (do_reset)
  {
    value |= 2;     // reset all counters to zero.
    value |= 4;     // reset cycle counter to zero.
  } 

  if (enable_divider)
    value |= 8;     // enable "by 64" divider for CCNT.

  value |= 16;

  // program the performance-counter control-register:
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));  

  // enable all counters:  
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));  

  // clear overflows:
  asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}

do_resetwill set the cycle-counter to zero. Easy as that.

do_reset将循环计数器设置为零。就这么简单。

enable_diverwill enable the 1/64 cycle divider. Without this flag set you'll be measuring each cycle. With it enabled the counter gets increased for every 64 cycles. This is useful if you want to measure long times that would otherwise cause the counter to overflow.

enable_diver将启用 1/64 周期分频器。如果没有设置此标志，您将测量每个周期。启用它后，计数器每 64 个周期增加一次。如果您想测量很长时间，否则会导致计数器溢出，这很有用。

How to use it:

如何使用它：

  // init counters:
  init_perfcounters (1, 0); 

  // measure the counting overhead:
  unsigned int overhead = get_cyclecount();
  overhead = get_cyclecount() - overhead;    

  unsigned int t = get_cyclecount();

  // do some stuff here..
  call_my_function();

  t = get_cyclecount() - t;

  printf ("function took exactly %d cycles (including function call) ", t - overhead);

Should work on all Cortex-A8 CPUs..

应该适用于所有 Cortex-A8 CPU。

Oh - and some notes:

哦 - 还有一些注意事项：

Using these counters you'll measure the exact time between the two calls to get_cyclecount()including everything spent in other processes or in the kernel. There is no way to restrict the measurement to your process or a single thread.

使用这些计数器，您将测量两次调用之间的确切时间，以get_cyclecount()包括在其他进程或内核中花费的所有内容。无法将测量限制为您的进程或单个线程。

Also calling get_cyclecount()isn't free. It will compile to a single asm-instruction, but moves from the co-processor will stall the entire ARM pipeline. The overhead is quite high and can skew your measurement. Fortunately the overhead is also fixed, so you can measure it and subtract it from your timings.

打电话get_cyclecount()也不免费。它将编译为单个 asm 指令，但从协处理器移出会使整个 ARM 流水线停顿。开销非常高，可能会影响您的测量。幸运的是，开销也是固定的，因此您可以测量它并从您的计时中减去它。

In my example I did that for every measurement. Don't do this in practice. An interrupt will sooner or later occur between the two calls and skew your measurements even further. I suggest that you measure the overhead a couple of times on an idle system, ignore all outsiders and use a fixed constant instead.

在我的例子中，我对每次测量都这样做了。在实践中不要这样做。两次调用之间迟早会发生中断，从而进一步扭曲您的测量。我建议您在空闲系统上测量几次开销，忽略所有外部人员并使用固定常量代替。

Answer 2

回答by Praveen S

You need to profile your code with performance analysis tools before and after your optimizations.

在优化之前和之后，您需要使用性能分析工具来分析您的代码。

Acctis a command line and a function which you can use to monitor your resources. You can google more on the usage and viewing of the dat file hence generated by acct.

Acct是一个命令行和一个函数，您可以使用它来监视您的资源。您可以在 google 上搜索更多有关由 acct 生成的 dat 文件的使用和查看的信息。

I will update this post with other opensource performance analysis tools.

我将使用其他开源性能分析工具更新这篇文章。

Gprofis another such tool. Please check the documentation for the same.

Gprof是另一个这样的工具。请检查相同的文档。

Answer 3

回答by Badmanton Casio

To expand on the answer by Nils now that a couple of years have elapsed! - an easy way to access these counters is to build the kernel with gator. This then reports counter values for use with Streamline, which is ARM's performance analysis tool.

现在已经过去了几年，现在扩展 Nils 的答案！- 访问这些计数器的一种简单方法是使用 gator 构建内核。然后报告计数器值以与Streamline一起使用，Streamline是 ARM 的性能分析工具。

It will display each function on a timeline (giving you a high-level overview of how your system is performing), showing you exactly how long it took to execute, along with % CPU that it has taken up. You can compare this with charts of each counter that you've set it up to collect and follow CPU intensive tasks down to source code level.

它将在时间轴上显示每个功能（为您提供系统执行情况的高级概览），向您显示执行所需的确切时间，以及它占用的 CPU 百分比。您可以将此与您设置的每个计数器的图表进行比较，以收集和跟踪 CPU 密集型任务到源代码级别。

Streamline works with all the Cortex-A series processors.

Streamline 适用于所有 Cortex-A 系列处理器。

Answer 4

回答by Digikata

I've worked in an toolchain for ARM7 which had an instruction level simulator. Running apps in that could give timings for individual lines and/or asm instruction. That was great for a micro optimization of a given routine. That approach probably isn't appropriate for a whole app/whole system optimization though.

我曾在具有指令级模拟器的 ARM7 工具链中工作。在其中运行应用程序可以为单个行和/或 asm 指令提供时间。这对于给定例程的微优化非常有用。不过，这种方法可能不适用于整个应用程序/整个系统的优化。

C语言如何测量 ARM Cortex-A8 处理器中的程序执行时间？

提问by HaggarTheHorrible

Follow ups

跟进

回答by Nils Pipenbrinck

回答by Praveen S

回答by Badmanton Casio

回答by Digikata

相关推荐

最近更新

标签

C语言 如何测量 ARM Cortex-A8 处理器中的程序执行时间？

提问by HaggarTheHorrible

Follow ups

跟进

回答by Nils Pipenbrinck

回答by Praveen S

回答by Badmanton Casio

回答by Digikata

相关推荐

C语言 越界访问数组有多危险？

C语言 C中的宏常量和常量变量有什么区别？

C语言 scanf 正则表达式 - C

C语言 如何在C中的字符串中找到字符的索引？

相关推荐

最近更新

标签

C语言如何测量 ARM Cortex-A8 处理器中的程序执行时间？

C语言越界访问数组有多危险？

C语言如何在C中的字符串中找到字符的索引？