linux perf：如何解释和查找热点

Question

提问by milianw

I tried out linux' perfutility today and am having trouble in interpreting its results. I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.

我今天尝试了 linux 的perf实用程序，但在解释其结果时遇到了麻烦。我已经习惯了 valgrind 的 callgrind，这当然是一种完全不同的基于采样的 perf 方法。

What I did:

我做了什么：

perf record -g -p $(pidof someapp)
perf report -g -n

Now I see something like this:

现在我看到这样的事情：

+     16.92%  kdevelop  libsqlite3.so.0.8.6               [.] 0x3fe57                                                                                                              ↑
+     10.61%  kdevelop  libQtGui.so.4.7.3                 [.] 0x81e344                                                                                                             ?
+      7.09%  kdevelop  libc-2.14.so                      [.] 0x85804                                                                                                              ?
+      4.96%  kdevelop  libQtGui.so.4.7.3                 [.] 0x265b69                                                                                                             ?
+      3.50%  kdevelop  libQtCore.so.4.7.3                [.] 0x18608d                                                                                                             ?
+      2.68%  kdevelop  libc-2.14.so                      [.] memcpy                                                                                                               ?
+      1.15%  kdevelop  [kernel.kallsyms]                 [k] copy_user_generic_string                                                                                             ?
+      0.90%  kdevelop  libQtGui.so.4.7.3                 [.] QTransform::translate(double, double)                                                                                ?
+      0.88%  kdevelop  libc-2.14.so                      [.] __libc_malloc                                                                                                        ?
+      0.85%  kdevelop  libc-2.14.so                      [.] memcpy 
...

Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

好的，这些函数可能很慢，但是我如何找出它们是从哪里调用的？由于所有这些热点都存在于外部库中，我认为没有办法优化我的代码。

Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.

基本上，我正在寻找某种带有累积成本注释的调用图，其中我的函数比我调用的库函数具有更高的包容性采样成本。

Is this possible with perf? If so - how?

这可能与性能有关吗？如果是这样 - 如何？

Note: I found out that "E" unwraps the callgraph and gives somewhat more information. But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where. Example:

注意：我发现“E”打开了调用图并提供了更多信息。但是调用图通常不够深和/或随机终止，而没有提供有关在何处花费了多少信息的信息。例子：

-     10.26%  kate  libkatepartinterfaces.so.4.6.0  [.] Kate::TextLoader::readLine(int&...
     Kate::TextLoader::readLine(int&, int&)                                            
     Kate::TextBuffer::load(QString const&, bool&, bool&)                              
     KateBuffer::openFile(QString const&)                                              
     KateDocument::openFile()                                                          
     0x7fe37a81121c

Could it be an issue that I'm running on 64 bit? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html(I'm not using fedora but seems to apply to all 64bit systems).

可能是我在 64 位上运行的问题吗？另请参阅：http: //lists.fedoraproject.org/pipermail/devel/2010-November/144952.html（我没有使用 Fedora，但似乎适用于所有 64 位系统）。

Answer 1

采纳答案by Martin Gerhardy

You should give hotspot a try: https://www.kdab.com/hotspot-gui-linux-perf-profiler/

你应该试试热点：https: //www.kdab.com/hotspot-gui-linux-perf-profiler/

It's available on github: https://github.com/KDAB/hotspot

它在 github 上可用：https: //github.com/KDAB/hotspot

It is for example able to generate flamegraphs for you.

例如，它能够为您生成火焰图。

Answer 2

回答by Mike Dunlavey

Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering. In fact, the well-known profiler gprofwas created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).

除非您的程序具有很少的函数并且几乎从不调用系统函数或 I/O，否则对程序计数器进行采样的分析器不会告诉您太多信息，正如您所发现的。事实上，著名的分析器gprof是专门为尝试解决仅自我时间分析的无用问题而创建的（并不是说它成功了）。

What actually works is something that samples the call stack(thereby finding out where the calls are coming from), on wall-clocktime (thereby including I/O time), and report by line or by instruction(thereby pinpointing the function calls that you should investigate, not just the functions they live in).

实际工作的是对调用堆栈进行采样（从而找出调用来自何处）、挂钟时间（从而包括 I/O 时间）并按行或按指令报告（从而查明函数调用您应该调查，而不仅仅是他们所居住的功能）。

Furthermore, the statistic you should look for is percent of time on stack, not number of calls, not average inclusive function time. Especially not "self time".If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save? 38%!Pretty simple, no?

此外，您应该寻找的统计数据是堆栈上的时间百分比，而不是调用次数，而不是平均包含函数时间。尤其不是“自我时间”。如果调用指令（或非调用指令）有 38% 的时间在堆栈上，那么如果您可以摆脱它，您会节省多少？38%！很简单，不是吗？

An example of such a profiler is Zoom.

这种分析器的一个例子是Zoom。

There are more issues to be understoodon this subject.

关于这个主题还有更多的问题需要理解。

Added: @caf got me hunting for the perfinfo, and since you included the command-line argument -git does collect stack samples. Then you can get a call-treereport. Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almostwhat you need.

补充：@caf 让我寻找perf信息，因为你包含了命令行参数，-g它确实收集了堆栈样本。然后你可以得到一个调用树报告。然后，如果您确保按挂钟时间进行采样（这样您就可以获得等待时间和 CPU 时间），那么您几乎已经得到了所需的东西。

Answer 3

回答by Mike Dunlavey

Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.

好的，这些函数可能很慢，但是我如何找出它们是从哪里调用的？由于所有这些热点都存在于外部库中，我认为没有办法优化我的代码。

Are you sure that your application someappis built with the gcc option -fno-omit-frame-pointer(and possibly its dependant libraries) ? Something like this:

您确定您的应用程序someapp是使用 gcc 选项-fno-omit-frame-pointer（可能还有它的依赖库）构建的吗？像这样的东西：

g++ -m64 -fno-omit-frame-pointer -g main.cpp

Answer 4

回答by milianw

With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:

使用 Linux 3.7 perf 终于能够使用 DWARF 信息来生成调用图：

perf record --call-graph dwarf -- yourapp
perf report -g graph --no-children

Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

整洁，但与 VTune、KCacheGrind 或类似的相比，curses GUI 太可怕了……我建议改用 FlameGraphs，这是一个非常简洁的可视化：http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Note: In the report step, -g graphmakes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers. --no-childrenwill show only self cost, rather than inclusive cost - a feature that I also find invaluable.

注意：在报告步骤中，-g graph使结果输出易于理解“相对于总数”的百分比，而不是“相对于父级”的数字。--no-children将只显示自我成本，而不是包含成本——我也认为这个功能非常宝贵。

If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:

如果您有新的性能和 Intel CPU，还可以尝试 LBR 展开器，它具有更好的性能并生成更小的结果文件：

perf record --call-graph lbr -- yourapp

The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.

这里的缺点是与默认的 DWARF 展开器配置相比，调用堆栈深度更加有限。

Answer 5

回答by Ali

You can get a very detailed, source level report with perf annotate, see Source level analysis with perf annotate. It will look something like this (shamelessly stolen from the website):

您可以使用获得非常详细的源级报告perf annotate，请参阅使用 perf annotate 进行源级分析。它看起来像这样（无耻地从网站上偷来的）：

------------------------------------------------
 Percent |   Source code & Disassembly of noploop
------------------------------------------------
         :
         :
         :
         :   Disassembly of section .text:
         :
         :   08048484 <main>:
         :   #include <string.h>
         :   #include <unistd.h>
         :   #include <sys/time.h>
         :
         :   int main(int argc, char **argv)
         :   {
    0.00 :    8048484:       55                      push   %ebp
    0.00 :    8048485:       89 e5                   mov    %esp,%ebp
[...]
    0.00 :    8048530:       eb 0b                   jmp    804853d <main+0xb9>
         :                           count++;
   14.22 :    8048532:       8b 44 24 2c             mov    0x2c(%esp),%eax
    0.00 :    8048536:       83 c0 01                add    ##代码##x1,%eax
   14.78 :    8048539:       89 44 24 2c             mov    %eax,0x2c(%esp)
         :           memcpy(&tv_end, &tv_now, sizeof(tv_now));
         :           tv_end.tv_sec += strtol(argv[1], NULL, 10);
         :           while (tv_now.tv_sec < tv_end.tv_sec ||
         :                  tv_now.tv_usec < tv_end.tv_usec) {
         :                   count = 0;
         :                   while (count < 100000000UL)
   14.78 :    804853d:       8b 44 24 2c             mov    0x2c(%esp),%eax
   56.23 :    8048541:       3d ff e0 f5 05          cmp    ##代码##x5f5e0ff,%eax
    0.00 :    8048546:       76 ea                   jbe    8048532 <main+0xae>
[...]

Don't forget to pass the -fno-omit-frame-pointerand the -ggdbflags when you compile your code.

不要忘记在编译代码时传递-fno-omit-frame-pointer和-ggdb标志。

linux perf：如何解释和查找热点

提问by milianw

采纳答案by Martin Gerhardy

回答by Mike Dunlavey

回答by Mike Dunlavey

回答by milianw

回答by Ali

相关推荐

最近更新

标签

linux perf：如何解释和查找热点

提问by milianw

采纳答案by Martin Gerhardy

回答by Mike Dunlavey

回答by Mike Dunlavey

回答by milianw

回答by Ali

相关推荐

Linux mmap：在用户空间映射使用 kmalloc 分配的内核缓冲区

Linux 我如何知道teamviewer 是否成功执行并获取会话ID 和密码？

Linux 的 Dependency Walker 等价物？

Linux Buffer size for capturing packets in kernel space?

相关推荐

最近更新

标签