linux perf:如何解释和查找热点

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7031210/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 05:39:43  来源:igfitidea点击:

linux perf: how to interpret and find hotspots

c++linuxperformanceprofilingperf

提问by milianw

I tried out linux' perfutility today and am having trouble in interpreting its results. I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.

我今天尝试了 linux 的perf实用程序,但在解释其结果时遇到了麻烦。我已经习惯了 valgrind 的 callgrind,这当然是一种完全不同的基于采样的 perf 方法。

What I did:

我做了什么:

perf record -g -p $(pidof someapp)
perf report -g -n

Now I see something like this:

现在我看到这样的事情:

+     16.92%  kdevelop  libsqlite3.so.0.8.6               [.] 0x3fe57                                                                                                              ↑
+     10.61%  kdevelop  libQtGui.so.4.7.3                 [.] 0x81e344                                                                                                             ?
+      7.09%  kdevelop  libc-2.14.so                      [.] 0x85804                                                                                                              ?
+      4.96%  kdevelop  libQtGui.so.4.7.3                 [.] 0x265b69                                                                                                             ?
+      3.50%  kdevelop  libQtCore.so.4.7.3                [.] 0x18608d                                                                                                             ?
+      2.68%  kdevelop  libc-2.14.so                      [.] memcpy                                                                                                               ?
+      1.15%  kdevelop  [kernel.kallsyms]                 [k] copy_user_generic_string                                                                                             ?
+      0.90%  kdevelop  libQtGui.so.4.7.3                 [.] QTransform::translate(double, double)                                                                                ?
+      0.88%  kdevelop  libc-2.14.so                      [.] __libc_malloc                                                                                                        ?
+      0.85%  kdevelop  libc-2.14.so                      [.] memcpy 
...

Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

好的,这些函数可能很慢,但是我如何找出它们是从哪里调用的?由于所有这些热点都存在于外部库中,我认为没有办法优化我的代码。

Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.

基本上,我正在寻找某种带有累积成本注释的调用图,其中我的函数比我调用的库函数具有更高的包容性采样成本。

Is this possible with perf? If so - how?

这可能与性能有关吗?如果是这样 - 如何?

Note: I found out that "E" unwraps the callgraph and gives somewhat more information. But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where. Example:

注意:我发现“E”打开了调用图并提供了更多信息。但是调用图通常不够深和/或随机终止,而没有提供有关在何处花费了多少信息的信息。例子:

-     10.26%  kate  libkatepartinterfaces.so.4.6.0  [.] Kate::TextLoader::readLine(int&...
     Kate::TextLoader::readLine(int&, int&)                                            
     Kate::TextBuffer::load(QString const&, bool&, bool&)                              
     KateBuffer::openFile(QString const&)                                              
     KateDocument::openFile()                                                          
     0x7fe37a81121c

Could it be an issue that I'm running on 64 bit? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html(I'm not using fedora but seems to apply to all 64bit systems).

可能是我在 64 位上运行的问题吗?另请参阅:http: //lists.fedoraproject.org/pipermail/devel/2010-November/144952.html(我没有使用 Fedora,但似乎适用于所有 64 位系统)。

采纳答案by Martin Gerhardy

You should give hotspot a try: https://www.kdab.com/hotspot-gui-linux-perf-profiler/

你应该试试热点:https: //www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot

它在 github 上可用:https: //github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.

例如,它能够为您生成火焰图。

flamegraph

火焰图

回答by Mike Dunlavey

Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering. In fact, the well-known profiler gprofwas created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).

除非您的程序具有很少的函数并且几乎从不调用系统函数或 I/O,否则对程序计数器进行采样的分析器不会告诉您太多信息,正如您所发现的。事实上,著名的分析器gprof是专门为尝试解决仅自我时间分析的无用问题而创建的(并不是说它成功了)。

What actually works is something that samples the call stack(thereby finding out where the calls are coming from), on wall-clocktime (thereby including I/O time), and report by line or by instruction(thereby pinpointing the function calls that you should investigate, not just the functions they live in).

实际工作的是对调用堆栈进行采样(从而找出调用来自何处)、挂钟时间(从而包括 I/O 时间)并按行或按指令报告(从而查明函数调用您应该调查,而不仅仅是他们所居住的功能)。

Furthermore, the statistic you should look for is percent of time on stack, not number of calls, not average inclusive function time. Especially not "self time".If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save? 38%!Pretty simple, no?

此外,您应该寻找的统计数据是堆栈上的时间百分比,而不是调用次数,而不是平均包含函数时间。尤其不是“自我时间”。如果调用指令(或非调用指令)有 38% 的时间在堆栈上,那么如果您可以摆脱它,您会节省多少?38%!很简单,不是吗?

An example of such a profiler is Zoom.

这种分析器的一个例子是Zoom

There are more issues to be understoodon this subject.

关于这个主题还有更多的问题需要理解

Added: @caf got me hunting for the perfinfo, and since you included the command-line argument -git does collect stack samples. Then you can get a call-treereport. Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almostwhat you need.

补充:@caf 让我寻找perf信息,因为你包含了命令行参数,-g它确实收集了堆栈样本。然后你可以得到一个调用树报告。然后,如果您确保按挂钟时间进行采样(这样您就可以获得等待时间和 CPU 时间),那么您几乎已经得到了所需的东西。

回答by Mike Dunlavey

Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

好的,这些函数可能很慢,但是我如何找出它们是从哪里调用的?由于所有这些热点都存在于外部库中,我认为没有办法优化我的代码。

Are you sure that your application someappis built with the gcc option -fno-omit-frame-pointer(and possibly its dependant libraries) ? Something like this:

您确定您的应用程序someapp是使用 gcc 选项-fno-omit-frame-pointer(可能还有它的依赖库)构建的吗?像这样的东西:

g++ -m64 -fno-omit-frame-pointer -g main.cpp

回答by milianw

With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:

使用 Linux 3.7 perf 终于能够使用 DWARF 信息来生成调用图:

perf record --call-graph dwarf -- yourapp
perf report -g graph --no-children

Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

整洁,但与 VTune、KCacheGrind 或类似的相比,curses GUI 太可怕了……我建议改用 FlameGraphs,这是一个非常简洁的可视化:http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Note: In the report step, -g graphmakes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers. --no-childrenwill show only self cost, rather than inclusive cost - a feature that I also find invaluable.

注意:在报告步骤中,-g graph使结果输出易于理解“相对于总数”的百分比,而不是“相对于父级”的数字。--no-children将只显示自我成本,而不是包含成本——我也认为这个功能非常宝贵。

If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:

如果您有新的性能和 Intel CPU,还可以尝试 LBR 展开器,它具有更好的性能并生成更小的结果文件:

perf record --call-graph lbr -- yourapp

The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.

这里的缺点是与默认的 DWARF 展开器配置相比,调用堆栈深度更加有限。

回答by Ali

You can get a very detailed, source level report with perf annotate, see Source level analysis with perf annotate. It will look something like this (shamelessly stolen from the website):

您可以使用 获得非常详细的源级报告perf annotate,请参阅使用 perf annotate 进行源级分析。它看起来像这样(无耻地从网站上偷来的):

------------------------------------------------
 Percent |   Source code & Disassembly of noploop
------------------------------------------------
         :
         :
         :
         :   Disassembly of section .text:
         :
         :   08048484 <main>:
         :   #include <string.h>
         :   #include <unistd.h>
         :   #include <sys/time.h>
         :
         :   int main(int argc, char **argv)
         :   {
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
[...]
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
         :                           count++;
   14.22 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    ##代码##x1,%eax
   14.78 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
         :           memcpy(&tv_end, &tv_now, sizeof(tv_now));
         :           tv_end.tv_sec += strtol(argv[1], NULL, 10);
         :           while (tv_now.tv_sec < tv_end.tv_sec ||
         :                  tv_now.tv_usec < tv_end.tv_usec) {
         :                   count = 0;
         :                   while (count < 100000000UL)
   14.78 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.23 :    8048541:       3d ff e0 f5 05          cmp    ##代码##x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>
[...]

Don't forget to pass the -fno-omit-frame-pointerand the -ggdbflags when you compile your code.

不要忘记在编译代码时传递-fno-omit-frame-pointer-ggdb标志。