在 C++ 中调用函数有多少开销?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/144993/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How much overhead is there in calling a function in C++?
提问by Obediah Stane
A lot of literature talks about using inline functions to "avoid the overhead of a function call". However I haven't seen quantifiable data. What is the actual overhead of a function call i.e. what sort of performance increase do we achieve by inlining functions?
许多文献都谈到使用内联函数来“避免函数调用的开销”。但是我还没有看到可量化的数据。函数调用的实际开销是多少,即我们通过内联函数实现了什么样的性能提升?
回答by Eclipse
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpagehas a description of what's involved in the various calling conventions.
在大多数体系结构中,成本包括将所有(或部分或没有)寄存器保存到堆栈中,将函数参数推入堆栈(或将它们放入寄存器中),增加堆栈指针并跳转到堆栈的开头新代码。然后当函数完成时,你必须从堆栈中恢复寄存器。 此网页描述了各种调用约定中涉及的内容。
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
大多数 C++ 编译器现在都足够智能,可以为您内联函数。inline 关键字只是对编译器的一个提示。有些人甚至会在他们认为有帮助的翻译单元之间进行内联。
回答by nedruod
There's the technical and the practical answer. The practical answer is it will never matter, and in the very rare case it does the only way you'll know is through actual profiled tests.
有技术和实际的答案。实际的答案是它永远无关紧要,在极少数情况下,您知道的唯一方法是通过实际的配置文件测试。
The technical answer, which your literature refers to, is generally not relevant due to compiler optimizations. But if you're still interested, is well described by Josh.
由于编译器优化,您的文献所引用的技术答案通常不相关。但如果你仍然感兴趣,Josh很好地描述了。
As far as a "percentage" you'd have to know how expensive the function itself was. Outside of the cost of the called function there is no percentage because you are comparing to a zero cost operation. For inlined code there is no cost, the processor just moves to the next instruction. The downside to inling is a larger code size which manifests it's costs in a different way than the stack construction/tear down costs.
至于“百分比”,您必须知道功能本身的成本。在被调用函数的成本之外没有百分比,因为您正在与零成本操作进行比较。对于内联代码没有成本,处理器只是移动到下一条指令。Inling 的缺点是更大的代码大小,这表明它的成本与堆栈构建/拆卸成本不同。
回答by PSkocik
I made a simple benchmark against a simple increment function:
我针对一个简单的增量函数做了一个简单的基准测试:
inc.c:
公司:
typedef unsigned long ulong;
ulong inc(ulong x){
return x+1;
}
main.c
主文件
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
#ifdef EXTERN
ulong inc(ulong);
#else
static inline ulong inc(ulong x){
return x+1;
}
#endif
int main(int argc, char** argv){
if (argc < 1+1)
return 1;
ulong i, sum = 0, cnt;
cnt = atoi(argv[1]);
for(i=0;i<cnt;i++){
sum+=inc(i);
}
printf("%lu\n", sum);
return 0;
}
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHzgave me:
在我的Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz上运行十亿次迭代给我:
- 1.4 secondsfor the inlininingversion
- 4.4 secondsfor the regularly linkedversion
- 1.4秒为inlinining版本
- 定期链接版本为4.4 秒
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
(它似乎波动高达 0.2,但我懒得计算适当的标准偏差,我也不关心它们)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
这表明这台计算机上函数调用的开销约为3 纳秒
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
我测量的最快速度大约是 0.3ns,所以这表明一个函数调用成本大约为9 个原始 ops,非常简单地说。
This overhead increases by about another 2nsper call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).
对于通过 PLT 调用的函数(共享库中的函数),每次调用的开销增加了大约2ns(总调用时间大约为6ns)。
回答by Mecki
Your question is one of the questions, that has no answer one could call the "absolute truth". The overhead of a normal function call depends on three factors:
你的问题是问题之一,没有答案可以称之为“绝对真理”。正常函数调用的开销取决于三个因素:
The CPU. The overhead of x86, PPC, and ARM CPUs varies a lot and even if you just stay with one architecture, the overhead also varies quite a bit between an Intel Pentium 4, Intel Core 2 Duo and an Intel Core i7. The overhead might even vary noticeably between an Intel and an AMD CPU, even if both run at the same clock speed, since factors like cache sizes, caching algorithms, memory access patterns and the actual hardware implementation of the call opcode itself can have a huge influence on the overhead.
The ABI (Application Binary Interface). Even with the same CPU, there often exist different ABIs that specify how function calls pass parameters (via registers, via stack, or via a combination of both) and where and how stack frame initialization and clean-up takes place. All this has an influence on the overhead. Different operating systems may use different ABIs for the same CPU; e.g. Linux, Windows and Solaris may all three use a different ABI for the same CPU.
The Compiler. Strictly following the ABI is only important if functions are called between independent code units, e.g. if an application calls a function of a system library or a user library calls a function of another user library. As long as functions are "private", not visible outside a certain library or binary, the compiler may "cheat". It may not strictly follow the ABI but instead use shortcuts that lead to faster function calls. E.g. it may pass parameters in register instead of using the stack or it may skip stack frame setup and clean-up completely if not really necessary.
中央处理器。x86、PPC 和 ARM CPU 的开销差异很大,即使您只使用一种架构,Intel Pentium 4、Intel Core 2 Duo 和 Intel Core i7 之间的开销也有很大差异。Intel 和 AMD CPU 之间的开销甚至可能会有显着差异,即使两者都以相同的时钟速度运行,因为缓存大小、缓存算法、内存访问模式和调用操作码本身的实际硬件实现等因素可能会产生巨大的影响。对开销的影响。
ABI(应用程序二进制接口)。即使使用相同的 CPU,也经常存在不同的 ABI,这些 ABI 指定函数调用如何传递参数(通过寄存器、通过堆栈或通过两者的组合)以及堆栈帧初始化和清理发生的位置和方式。所有这些都会对开销产生影响。不同的操作系统可能对同一个 CPU 使用不同的 ABI;例如,Linux、Windows 和 Solaris 可能都对同一个 CPU 使用不同的 ABI。
编译器。只有在独立代码单元之间调用函数时,严格遵循 ABI 才重要,例如,如果应用程序调用系统库的函数或用户库调用另一个用户库的函数。只要函数是“私有的”,在某个库或二进制文件之外不可见,编译器就可能“作弊”。它可能不会严格遵循 ABI,而是使用导致更快函数调用的快捷方式。例如,它可能会在寄存器中传递参数而不是使用堆栈,或者如果不是真的需要,它可能会完全跳过堆栈帧设置和清理。
If you want to know the overhead for a specific combination of the three factors above, e.g. for Intel Core i5 on Linux using GCC, your only way to get this information is benchmarking the difference between two implementations, one using function calls and one where you copy the code directly into the caller; this way you force inlining for sure, since the inline statement is only a hint and does not always lead to inlining.
如果您想知道上述三个因素的特定组合的开销,例如对于使用 GCC 的 Linux 上的英特尔酷睿 i5,您获得此信息的唯一方法是对两种实现之间的差异进行基准测试,一种使用函数调用,另一种使用将代码直接复制到调用者中;这样你肯定会强制内联,因为内联语句只是一个提示,并不总是导致内联。
However, the real question here is: Does the exact overhead really matter? One thing is for sure: A function call always has an overhead. It may be small, it may be big, but it is for sure existent. And no matter how small it is if a function is called often enough in a performance critical section, the overhead will matter to some degree. Inlining rarely makes your code slower, unless you terribly overdo it; it will make the code bigger though. Today's compilers are pretty good at deciding themselves when to inline and when not, so you hardly ever have to rack your brain about it.
然而,这里真正的问题是:确切的开销真的很重要吗?有一件事是肯定的:函数调用总是有开销的。它可能很小,可能很大,但它肯定存在。如果一个函数在性能关键部分中被足够频繁地调用,无论它有多小,开销都会在一定程度上产生影响。内联很少会让你的代码变慢,除非你做得太过分了;不过,它会使代码更大。今天的编译器非常擅长决定自己何时内联,何时不内联,因此您几乎不必为此绞尽脑汁。
Personally I ignore inlining during development completely, until I have a more or less usable product that I can profile and only if profiling tells me, that a certain function is called really often and also within a performance critical section of the application, then I will consider "force-inlining" of this function.
就我个人而言,我在开发过程中完全忽略了内联,直到我有一个或多或少可用的产品,我可以对其进行分析,并且只有当分析告诉我某个函数确实经常被调用并且也在应用程序的性能关键部分内被调用时,然后我才会考虑这个函数的“强制内联”。
So far my answer is very generic, it applies to C as much as it applies to C++ and Objective-C. As a closing word let me say something about C++ in particular: Methods that are virtual are double indirect function calls, that means they have a higher function call overhead than normal function calls and also they cannot be inlined. Non-virtual methods might be inlined by the compiler or not but even if they are not inlined, they are still significant faster than virtual ones, so you should not make methods virtual, unless you really plan to override them or have them overridden.
到目前为止,我的回答是非常通用的,它适用于 C,就像适用于 C++ 和 Objective-C。作为结束语,让我特别谈谈 C++:虚拟方法是双重间接函数调用,这意味着它们比普通函数调用具有更高的函数调用开销,而且它们不能被内联。非虚拟方法可能会被编译器内联或不内联,但即使它们没有内联,它们仍然比虚拟方法快得多,所以你不应该将方法设为虚拟,除非你真的打算覆盖它们或让它们被覆盖。
回答by Mark Ransom
The amount of overhead will depend on the compiler, CPU, etc. The percentage overhead will depend on the code you're inlining. The only way to know is to take your codeand profile it both ways - that's why there's no definitive answer.
开销的数量取决于编译器、CPU 等。开销的百分比取决于您内联的代码。唯一知道的方法是获取您的代码并以两种方式对其进行分析 - 这就是没有明确答案的原因。
回答by Don Neufeld
For very small functions inlining makes sense, because the (small) cost of the function call is significant relative to the (very small) cost of the function body. For most functions over a few lines it's not a big win.
对于非常小的函数,内联是有意义的,因为相对于函数体的(非常小的)成本,函数调用的(小)成本是显着的。对于几行的大多数功能来说,这并不是一个很大的胜利。
回答by Larry OBrien
It's worth pointing out that an inlined function increases the size of the calling function and anything that increases the size of a function may have a negative affect on caching. If you're right at a boundary, "just one more wafer thin mint" of inlined code might have a dramatically negative effect on performance.
值得指出的是,内联函数会增加调用函数的大小,任何增加函数大小的事情都可能对缓存产生负面影响。如果您正好处于边界处,内联代码的“再增加一个薄薄的薄荷”可能会对性能产生显着的负面影响。
If you're reading literature that's warning about "the cost of a function call," I'd suggest it may be older material that doesn't reflect modern processors. Unless you're in the embedded world, the era in which C is a "portable assembly language" has essentially passed. A large amount of the ingenuity of the chip designers in the past decade (say) has gone into all sorts of low-level complexities that can differ radically from the way things worked "back in the day."
如果您正在阅读警告“函数调用成本”的文献,我建议它可能是不能反映现代处理器的旧材料。除非您在嵌入式世界中,否则 C 是“可移植汇编语言”的时代基本上已经过去了。在过去十年中,芯片设计师的大量创造力(比如)已经进入了各种低级复杂性,这些复杂性可能与“过去”的工作方式截然不同。
回答by doug65536
Modern CPUs are very fast (obviously!). Almost every operation involved with calls and argument passing are full speed instructions (indirect calls might be slightly more expensive, mostly the first time through a loop).
现代 CPU 速度非常快(显然!)。几乎所有涉及调用和参数传递的操作都是全速指令(间接调用可能稍微贵一些,主要是第一次通过循环)。
Function call overhead is so small, only loops that call functions can make call overhead relevant.
函数调用开销非常小,只有调用函数的循环才能使调用开销相关。
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops. If a function has to do a bunch of (identical) work every time it is called, the compiler would be able to "hoist" it out of the loop and do it once if it was inlined. When not inlined, the code will probably just go ahead and repeat the work, you told it to!
因此,当我们今天谈论(和测量)函数调用开销时,我们通常真正谈论的是无法将公共子表达式提升出循环的开销。如果一个函数每次被调用时都必须做一堆(相同的)工作,编译器将能够将它“提升”出循环并在它被内联时执行一次。当没有内联时,代码可能会继续重复你告诉它的工作!
Inlined functions seem impossibly faster not because of call and argument overhead, but because of common subexpressions that can be hoisted out of the function.
内联函数看起来不可能更快,不是因为调用和参数开销,而是因为可以从函数中提升的公共子表达式。
Example:
例子:
Foo::result_type MakeMeFaster()
{
Foo t = 0;
for (auto i = 0; i < 1000; ++i)
t += CheckOverhead(SomethingUnpredictible());
return t.result();
}
Foo CheckOverhead(int i)
{
auto n = CalculatePi_1000_digits();
return i * n;
}
An optimizer can see through this foolishness and do:
优化器可以看穿这种愚蠢并执行以下操作:
Foo::result_type MakeMeFaster()
{
Foo t;
auto _hidden_optimizer_tmp = CalculatePi_1000_digits();
for (auto i = 0; i < 1000; ++i)
t += SomethingUnpredictible() * _hidden_optimizer_tmp;
return t.result();
}
It seems like call overhead is impossibly reduced because it really has hoised a big chunk of the function out of the loop (the CalculatePi_1000_digits call). The compiler would need to be able to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do that.
调用开销似乎不可能减少,因为它确实将大部分函数提升到循环之外(CalculatePi_1000_digits 调用)。编译器需要能够证明CalculatePi_1000_digits 总是返回相同的结果,但好的优化器可以做到这一点。
回答by vrdhn
There is a great concept called 'register shadowing', which allows to pass ( up to 6 ? ),values thru registers ( on CPU ) instead of stack ( memory ). Also, depending on the function and variables used within, compiler may just decide that frame management code is not required !!
有一个很棒的概念叫做“寄存器阴影”,它允许通过寄存器(在 CPU 上)而不是堆栈(内存)传递(最多 6 个?)值。此外,根据其中使用的函数和变量,编译器可能会决定不需要框架管理代码!!
Also, even C++ compiler may do a 'tail recursion optimiztaion', i.e. if A() calls B(), and after calling B(), A just returns, compiler will reuse the stack frame !!
此外,即使是 C++ 编译器也可能会进行“尾递归优化”,即如果 A() 调用 B(),并且在调用 B() 之后,A 只是返回,编译器将重用堆栈帧!
Of course, this all can be done, only if program sticks to the semantics of standard ( see pointer aliasing and it's effect on optimizations )
当然,这一切都可以做到,前提是程序坚持标准的语义(参见指针别名及其对优化的影响)
回答by Uri
There are a few issues here.
这里有几个问题。
If you have a smart enough compiler, it will do some automatic inlining for you even if you did not specify inline. On the other hand, there are many things that cannot be inlined.
If the function is virtual, then of course you are going to pay the price that it cannot be inlined because the target is determined at runtime. Conversely, in Java, you might be paying this price unless you indicate that the method is final.
Depending on how your code is organized in memory, you may be paying a cost in cache misses and even page misses as the code is located elsewhere. That can end up having a huge impact in some applications.
如果你有一个足够聪明的编译器,即使你没有指定内联,它也会为你做一些自动内联。另一方面,有很多东西不能内联。
如果函数是虚拟的,那么您当然要付出无法内联的代价,因为目标是在运行时确定的。相反,在 Java 中,除非您指明该方法是最终的,否则您可能会为此付出代价。
根据您的代码在内存中的组织方式,您可能会在缓存未命中甚至页面未命中方面付出代价,因为代码位于其他地方。这最终可能会对某些应用程序产生巨大影响。