如何分析在 Linux 上运行的 C++ 代码?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/375913/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I profile C++ code running on Linux?
提问by Gabriel Isenberg
I have a C++ application, running on Linux, which I'm in the process of optimizing. How can I pinpoint which areas of my code are running slowly?
我有一个在 Linux 上运行的 C++ 应用程序,我正在对其进行优化。如何确定代码的哪些区域运行缓慢?
采纳答案by Mike Dunlavey
If your goal is to use a profiler, use one of the suggested ones.
如果您的目标是使用分析器,请使用建议的分析器之一。
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
但是,如果您很着急并且可以在调试器下手动中断程序,而它主观上很慢,那么有一种简单的方法可以找到性能问题。
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So, that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
只需暂停几次,每次查看调用堆栈。如果有一些代码浪费了一定百分比的时间,20% 或 50% 或其他什么,这就是您在每个样本的行为中捕获它的可能性。因此,这大致是您将看到的样本百分比。不需要有根据的猜测。如果您确实猜测问题是什么,这将证明或反驳它。
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
您可能有多个不同大小的性能问题。如果您清除其中任何一个,其余的将占据更大的百分比,并且在随后的传球中更容易被发现。这种放大效应,当在多个问题上复合时,会导致真正巨大的加速因素。
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
警告:程序员往往对这种技术持怀疑态度,除非他们自己使用过。他们会说分析器为您提供了这些信息,但只有当它们对整个调用堆栈进行采样,然后让您检查一组随机样本时,情况才会如此。(摘要是洞察力丢失的地方。)调用图不会为您提供相同的信息,因为
- They don't summarize at the instruction level, and
- They give confusing summaries in the presence of recursion.
- 他们不会在指令级别进行总结,并且
- 他们在递归的情况下给出了令人困惑的总结。
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
他们还会说它只适用于玩具程序,而实际上它适用于任何程序,而且在更大的程序上似乎更有效,因为他们往往有更多的问题需要发现。他们会说,有时会找到的东西,都不是问题,但如果你看到的东西,是唯一真正的一次。如果您在多个样本上看到问题,那就是真实存在的问题。
P.S.This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
PS这也可以在多线程程序上完成,如果有一种方法可以在某个时间点收集线程池的调用堆栈样本,就像在 Java 中一样。
P.P.SAs a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
PPS粗略地说,您的软件中的抽象层越多,您就越有可能发现这是性能问题的原因(以及获得加速的机会)。
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
补充:这可能不是很明显,但堆栈采样技术在存在递归的情况下同样有效。原因是删除指令所节省的时间近似于包含它的样本的分数,而不管它在样本中可能出现的次数。
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
我经常听到的另一个反对意见是:“它会随机停在某个地方,它会错过真正的问题”。这来自于对真正的问题是什么有一个先验的概念。性能问题的一个关键特性是它们不符合预期。抽样告诉您某事有问题,您的第一反应是难以置信。这是很自然的,但您可以确定它是否发现了问题是真实存在的,反之亦然。
Added: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I
(call or otherwise) which is on the call stack some fraction f
of the time (and thus costs that much). For simplicity, suppose we don't know what f
is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
补充:让我对它的工作原理做一个贝叶斯解释。假设有一些指令I
(调用或其他方式)在调用堆栈上的某个f
时间(因此成本很高)。为简单起见,假设我们不知道是什么f
,但假设它是 0.1、0.2、0.3、... 0.9、1.0,并且这些可能性中的每一个的先验概率都是 0.1,因此所有这些成本的可能性是相等的先验的。
Then suppose we take just 2 stack samples, and we see instruction I
on both samples, designated observation o=2/2
. This gives us new estimates of the frequency f
of I
, according to this:
然后假设我们只取了 2 个堆栈样本,并且我们看到了I
关于两个样本的指令,指定了 Observation o=2/2
。这给了我们对 频率f
的新估计I
,根据这个:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f
>= 0.5 is 92%, up from the prior assumption of 60%.
例如,最后一列表示f
>= 0.5的概率为 92%,高于先前假设的 60%。
Suppose the prior assumptions are different. Suppose we assume P(f=0.1)
is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I
is cheap. Then we get:
假设先前的假设不同。假设我们假设P(f=0.1)
是 0.991(几乎确定),而所有其他可能性几乎是不可能的(0.001)。换句话说,我们之前的确定I
是便宜。然后我们得到:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5)
is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I
. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
现在它说P(f >= 0.5)
是 26%,高于先前假设的 0.6%。所以贝叶斯允许我们更新我们对 的可能成本的估计I
。如果数据量很小,它并不能准确地告诉我们成本是多少,只能说它大到值得修复。
Yet another way to look at it is called the Rule Of Succession.
If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin?
The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%
.
另一种看待它的方式称为继承规则。如果你抛硬币 2 次,而且两次都正面朝上,这能说明硬币的可能重量吗?受人尊敬的回答方式是说它是 Beta 分布,平均值为(number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%
。
(The key is that we see I
more than once. If we only see it once, that doesn't tell us much except that f
> 0.)
(关键是我们看到I
了不止一次。如果我们只看到一次,那除了f
> 0之外并不能告诉我们太多。)
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n
samples are taken, and f
is the cost, then I
will appear on nf+/-sqrt(nf(1-f))
samples. Example, n=10
, f=0.3
, that is 3+/-1.4
samples.)
因此,即使是非常少量的样本也可以告诉我们很多关于它看到的指令的成本。(而且将看到他们的频率,平均成比例的成本。如果n
采取试样,f
是成本,那么I
将出现在nf+/-sqrt(nf(1-f))
样品。实施例,n=10
,f=0.3
,即3+/-1.4
样品)。
Added: To give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes outis measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to findthe bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes.
So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
补充:为了直观感受测量和随机堆栈采样之间的差异:
现在有分析器可以对堆栈进行采样,即使在挂钟时间,但结果是测量值(或热路径或热点,从中“瓶颈”很容易隐藏)。他们没有向您展示(他们很容易可以)是实际样品本身。如果您的目标是找到瓶颈,则您需要查看的瓶颈数量平均为2 除以所需时间的分数。因此,如果需要 30% 的时间,则平均有 2/.3 = 6.7 个样本将显示它,而 20 个样本将显示它的几率为 99.2%。
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
这是检查测量值和检查堆栈样本之间差异的现成插图。瓶颈可能是像这样的一个大斑点,也可能是许多小斑点,这没有区别。
Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.
测量是水平的;它会告诉您特定例程需要多少时间。采样是垂直的。如果有任何方法可以避免当时整个程序正在执行的操作,并且如果您在第二个示例中看到它,那么您已经找到了瓶颈。这就是与众不同的原因 - 查看花费时间的全部原因,而不仅仅是花费多少。
回答by Nazgob
I assume you're using GCC. The standard solution would be to profile with gprof.
我假设您正在使用 GCC。标准的解决方案是使用gprof 进行分析。
Be sure to add -pg
to compilation before profiling:
确保-pg
在分析之前添加到编译:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
我还没有尝试过,但我听说过关于google-perftools 的好消息。这绝对值得一试。
Related question here.
相关问题在这里。
A few other buzzwords if gprof
does not do the job for you: Valgrind, Intel VTune, Sun DTrace.
回答by Ajay
You can use Valgrindwith the following options
您可以通过以下选项使用Valgrind
valgrind --tool=callgrind ./(Your binary)
It will generate a file called callgrind.out.x
. You can then use kcachegrind
tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
它将生成一个名为callgrind.out.x
. 然后您可以使用kcachegrind
工具读取此文件。它将为您提供图形分析结果,例如哪些线路成本多少。
回答by Ajay
I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
我将使用 Valgrind 和 Callgrind 作为我的分析工具套件的基础。重要的是要知道 Valgrind 基本上是一个虚拟机:
(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.
(维基百科)Valgrind 本质上是一个使用即时 (JIT) 编译技术的虚拟机,包括动态重新编译。原始程序中的任何内容都不会直接在主机处理器上运行。相反,Valgrind 首先将程序转换为一种临时的、更简单的形式,称为中间表示 (IR),这是一种与处理器无关、基于 SSA 的形式。转换后,在 Valgrind 将 IR 转换回机器代码并让主机处理器运行它之前,工具(见下文)可以自由地在 IR 上进行任何它想做的转换。
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probingprofiler.
Callgrind 是基于此构建的分析器。主要好处是您不必运行应用程序数小时即可获得可靠的结果。即使运行一秒钟也足以获得坚如磐石的可靠结果,因为 Callgrind 是一种非探测分析器。
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.
另一个基于 Valgrind 的工具是 Massif。我用它来分析堆内存使用情况。它工作得很好。它的作用是为您提供内存使用情况的快照——详细信息 WHAT 持有什么百分比的内存,以及谁把它放在那里。此类信息可在应用程序运行的不同时间点使用。
回答by Will
Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools
) AKA perf_events.
较新的内核(例如最新的 Ubuntu 内核)带有新的“perf”工具 ( apt-get install linux-tools
) AKA perf_events。
These come with classic sampling profilers (man-page) as well as the awesome timechart!
The important thing is that these tools can be system profilingand not just process profiling - they can show the interaction between threads, processes and the kernel and let you understand the scheduling and I/O dependencies between processes.
重要的是,这些工具可以进行系统分析而不仅仅是进程分析——它们可以显示线程、进程和内核之间的交互,让您了解进程之间的调度和 I/O 依赖关系。
回答by Rob_before_edits
This is a response to Nazgob's Gprof answer.
I've been using Gprof the last couple of days and have already found three significant limitations, one of which I've not seen documented anywhere else (yet):
过去几天我一直在使用 Gprof,并且已经发现了三个重大限制,其中之一我还没有在其他任何地方看到记录(还):
It doesn't work properly on multi-threaded code, unless you use a workaround
The call graph gets confused by function pointers. Example: I have a function called
multithread()
which enables me to multi-thread a specified function over a specified array (both passed as arguments). Gprof however, views all calls tomultithread()
as equivalent for the purposes of computing time spent in children. Since some functions I pass tomultithread()
take much longer than others my call graphs are mostly useless. (To those wondering if threading is the issue here: no,multithread()
can optionally, and did in this case, run everything sequentially on the calling thread only).It says herethat "... the number-of-calls figures are derived by counting, not sampling. They are completely accurate...". Yet I find my call graph giving me 5345859132+784984078 as call stats to my most-called function, where the first number is supposed to be direct calls, and the second recursive calls (which are all from itself). Since this implied I had a bug, I put in long (64-bit) counters into the code and did the same run again. My counts: 5345859132 direct, and 78094395406 self-recursive calls. There are a lot of digits there, so I'll point out the recursive calls I measure are 78bn, versus 784m from Gprof: a factor of 100 different. Both runs were single threaded and unoptimised code, one compiled
-g
and the other-pg
.
它不能在多线程代码上正常工作,除非您使用变通方法
调用图被函数指针弄糊涂了。示例:我有一个被调用的函数
multithread()
,它使我能够对指定数组(均作为参数传递)上的指定函数进行多线程处理。然而,Gprof 将所有调用都multithread()
视为等效的,以计算在儿童中花费的时间。由于我传递的某些函数multithread()
比其他函数花费的时间要长得多,因此我的调用图大多无用。(对于那些想知道线程是否是这里的问题的人:不,multithread()
可以选择,并且在这种情况下,只在调用线程上按顺序运行所有内容)。它在这里说“......调用次数数字是通过计数而不是采样得出的。它们是完全准确的......”。然而,我发现我的调用图给了我 5345859132+784984078 作为我最常调用的函数的调用统计数据,其中第一个数字应该是直接调用,第二个是递归调用(它们都来自自身)。因为这暗示我有一个错误,所以我在代码中放入了长(64 位)计数器并再次运行相同的程序。我的计数:5345859132 次直接调用和 78094395406 次自递归调用。那里有很多数字,所以我会指出我测量的递归调用是 780 亿,而 Gprof 是 7.84 亿:相差 100 倍。两次运行都是单线程和未优化的代码,一个是编译的
-g
,另一个是-pg
.
This was GNU Gprof(GNU Binutils for Debian) 2.18.0.20080103 running under 64-bit Debian Lenny, if that helps anyone.
这是在 64 位 Debian Lenny 下运行的GNU Gprof(用于 Debian 的 GNU Binutils)2.18.0.20080103,如果这对任何人有帮助的话。
回答by T?nu Samuel
The answer to run valgrind --tool=callgrind
is not quite complete without some options. We usually do not want to profile 10 minutes of slow startup time under Valgrind and want to profile our program when it is doing some task.
如果valgrind --tool=callgrind
没有一些选项, run 的答案是不完整的。我们通常不想在 Valgrind 下分析 10 分钟的缓慢启动时间,而是希望在执行某些任务时分析我们的程序。
So this is what I recommend. Run program first:
所以这就是我推荐的。先运行程序:
valgrind --tool=callgrind --dump-instr=yes -v --instr-atstart=no ./binary > tmp
Now when it works and we want to start profiling we should run in another window:
现在当它工作并且我们想要开始分析时,我们应该在另一个窗口中运行:
callgrind_control -i on
This turns profiling on. To turn it off and stop whole task we might use:
这将打开分析。要关闭它并停止整个任务,我们可能会使用:
callgrind_control -k
Now we have some files named callgrind.out.* in current directory. To see profiling results use:
现在我们在当前目录中有一些名为 callgrind.out.* 的文件。要查看分析结果,请使用:
kcachegrind callgrind.out.*
I recommend in next window to click on "Self" column header, otherwise it shows that "main()" is most time consuming task. "Self" shows how much each function itself took time, not together with dependents.
我建议在下一个窗口中单击“Self”列标题,否则它会显示“main()”是最耗时的任务。“自我”显示每个功能本身花费了多少时间,而不是与依赖项一起使用。
回答by seo
These are the two methods I use for speeding up my code:
这些是我用来加速我的代码的两种方法:
For CPU bound applications:
对于 CPU 密集型应用程序:
- Use a profiler in DEBUG mode to identify questionable parts of your code
- Then switch to RELEASE mode and comment out the questionable sections of your code (stub it with nothing) until you see changes in performance.
- 在调试模式下使用分析器来识别代码中的问题部分
- 然后切换到 RELEASE 模式并注释掉代码中的有问题的部分(什么都不做),直到您看到性能发生变化。
For I/O bound applications:
对于 I/O 绑定应用程序:
- Use a profiler in RELEASE mode to identify questionable parts of your code.
- 在 RELEASE 模式下使用分析器来识别代码中的有问题的部分。
N.B.
NB
If you don't have a profiler, use the poor man's profiler. Hit pause while debugging your application. Most developer suites will break into assembly with commented line numbers. You're statistically likely to land in a region that is eating most of your CPU cycles.
如果您没有分析器,请使用穷人的分析器。在调试应用程序时点击暂停。大多数开发人员套件将分解为带有注释行号的程序集。从统计上讲,您很可能会进入占用大部分 CPU 周期的区域。
For CPU, the reason for profiling in DEBUGmode is because if your tried profiling in RELEASEmode, the compiler is going to reduce math, vectorize loops, and inline functions which tends to glob your code into an un-mappable mess when it's assembled. An un-mappable mess means your profiler will not be able to clearly identify what is taking so long because the assembly may not correspond to the source code under optimization. If you need the performance (e.g. timing sensitive) of RELEASEmode, disable debugger features as needed to keep a usable performance.
对于 CPU,在DEBUG模式下进行分析的原因是因为如果您尝试在RELEASE模式下进行分析,编译器将减少数学运算、向量化循环和内联函数,这些函数在组装时往往会将您的代码变成不可映射的混乱。不可映射的混乱意味着您的分析器将无法清楚地识别花费这么长时间的内容,因为程序集可能与正在优化的源代码不对应。如果您需要RELEASE模式的性能(例如时序敏感),请根据需要禁用调试器功能以保持可用性能。
For I/O-bound, the profiler can still identify I/O operations in RELEASEmode because I/O operations are either externally linked to a shared library (most of the time) or in the worst case, will result in a sys-call interrupt vector (which is also easily identifiable by the profiler).
对于 I/O 绑定,探查器仍然可以识别RELEASE模式下的I/O 操作,因为 I/O 操作要么从外部链接到共享库(大部分时间),要么在最坏的情况下,将导致系统调用中断向量(也很容易被分析器识别)。
回答by Ehsan
Use Valgrind, callgrind and kcachegrind:
使用 Valgrind、callgrind 和 kcachegrind:
valgrind --tool=callgrind ./(Your binary)
generates callgrind.out.x. Read it using kcachegrind.
生成 callgrind.out.x。使用 kcachegrind 读取它。
Use gprof (add -pg):
使用 gprof(添加 -pg):
cc -o myprog myprog.c utils.c -g -pg
(not so good for multi-threads, function pointers)
(不太适合多线程,函数指针)
Use google-perftools:
使用谷歌性能工具:
Uses time sampling, I/O and CPU bottlenecks are revealed.
使用时间采样,显示 I/O 和 CPU 瓶颈。
Intel VTune is the best (free for educational purposes).
英特尔 VTune 是最好的(免费用于教育目的)。
Others:AMD Codeanalyst (since replaced with AMD CodeXL), OProfile, 'perf' tools (apt-get install linux-tools)
其他:AMD Codeanalyst(自从被 AMD CodeXL 取代)、OProfile、'perf' 工具(apt-get install linux-tools)
回答by fwyzard
For single-threaded programs you can use igprof, The Ignominous Profiler: https://igprof.org/.
对于单线程程序,您可以使用igprof,The Ignominous Profiler:https://igprof.org/ 。
It is a sampling profiler, along the lines of the... long... answer by Mike Dunlavey, which will gift wrap the results in a browsable call stack tree, annotated with the time or memory spent in each function, either cumulative or per-function.
它是一个采样分析器,沿着 Mike Dunlavey 的...长...答案,它将把结果包装在一个可浏览的调用堆栈树中,用每个函数中花费的时间或内存进行注释,无论是累积的还是每个功能。