C++ std::function 与模板

Question

提问by Red XIII

Thanks to C++11 we received the std::functionfamily of functor wrappers. Unfortunately, I keep hearing only bad things about these new additions. The most popular is that they are horribly slow. I tested it and they truly suck in comparison with templates.

感谢 C++11，我们收到了std::function函子包装器系列。不幸的是，我一直只听到有关这些新增内容的坏消息。最受欢迎的是它们非常慢。我测试了它，与模板相比，它们真的很糟糕。

#include <iostream>
#include <functional>
#include <string>
#include <chrono>

template <typename F>
float calc1(F f) { return -1.0f * f(3.3f) + 666.0f; }

float calc2(std::function<float(float)> f) { return -1.0f * f(3.3f) + 666.0f; }

int main() {
    using namespace std::chrono;

    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        calc1([](float arg){ return arg * 0.5f; });
    }
    const auto tp2 = high_resolution_clock::now();

    const auto d = duration_cast<milliseconds>(tp2 - tp1);  
    std::cout << d.count() << std::endl;
    return 0;
}

111 ms vs 1241 ms. I assume this is because templates can be nicely inlined, while functions cover the internals via virtual calls.

111 毫秒与 1241 毫秒。我认为这是因为模板可以很好地内联，而functions 通过虚拟调用覆盖内部。

Obviously templates have their issues as I see them:

显然，模板在我看来有其问题：

they have to be provided as headers which is not something you might not wish to do when releasing your library as a closed code,
they may make the compilation time much longer unless extern template-like policy is introduced,
there is no (at least known to me) clean way of representing requirements (concepts, anyone?) of a template, bar a comment describing what kind of functor is expected.

它们必须作为标头提供，这不是您在将库作为封闭代码发布时可能不希望做的事情，
除非extern template引入类似策略，否则它们可能会使编译时间更长，
没有（至少我知道）表示模板的需求（概念，任何人？）的干净方式，禁止描述期望什么样的函子的评论。

Can I thus assume that functions can be used as de factostandard of passing functors, and in places where high performance is expected templates should be used?

因此，我可以假设functions 可以用作传递函子的事实上的标准，并且在应该使用高性能模板的地方吗？

Edit:

编辑：

My compiler is the Visual Studio 2012 withoutCTP.

我的编译器是没有CTP的 Visual Studio 2012 。

Answer 1

回答by Andy Prowl

In general, if you are facing a designsituation that gives you a choice, use templates. I stressed the word designbecause I think what you need to focus on is the distinction between the use cases of std::functionand templates, which are pretty different.

一般来说，如果您面临的设计情况给您提供了选择，请使用模板。我强调设计这个词是因为我认为你需要关注的是用例std::function和模板之间的区别，它们是非常不同的。

In general, the choice of templates is just an instance of a wider principle: try to specify as many constraints as possible at compile-time. The rationale is simple: if you can catch an error, or a type mismatch, even before your program is generated, you won't ship a buggy program to your customer.

一般来说，模板的选择只是更广泛原则的一个实例：尝试在编译时指定尽可能多的约束。理由很简单：如果你能在你的程序生成之前发现错误或类型不匹配，你就不会向你的客户发送一个有缺陷的程序。

Moreover, as you correctly pointed out, calls to template functions are resolved statically (i.e. at compile time), so the compiler has all the necessary information to optimize and possibly inline the code (which would not be possible if the call were performed through a vtable).

此外，正如您正确指出的那样，对模板函数的调用是静态解析的（即在编译时），因此编译器具有优化和可能内联代码的所有必要信息（如果调用是通过表）。

Yes, it is true that template support is not perfect, and C++11 is still lacking a support for concepts; however, I don't see how std::functionwould save you in that respect. std::functionis not an alternative to templates, but rather a tool for design situations where templates cannot be used.

是的，模板支持确实不完善，C++11还缺乏对概念的支持；但是，我不知道如何std::function在这方面拯救你。std::function不是模板的替代品，而是用于无法使用模板的设计情况的工具。

One such use case arises when you need to resolve a call at run-timeby invoking a callable object that adheres to a specific signature, but whose concrete type is unknown at compile-time. This is typically the case when you have a collection of callbacks of potentially different types, but which you need to invoke uniformly; the type and number of the registered callbacks is determined at run-time based on the state of your program and the application logic. Some of those callbacks could be functors, some could be plain functions, some could be the result of binding other functions to certain arguments.

当您需要通过调用遵循特定签名但其具体类型在编译时未知的可调用对象来解决运行时调用时，就会出现这样一种用例。当您有一组可能不同类型的回调，但需要统一调用时，通常就是这种情况；注册回调的类型和数量是在运行时根据程序状态和应用程序逻辑确定的。其中一些回调可能是函子，一些可能是普通函数，一些可能是将其他函数绑定到某些参数的结果。

std::functionand std::bindalso offer a natural idiom for enabling functional programmingin C++, where functions are treated as objects and get naturally curried and combined to generate other functions. Although this kind of combination can be achieved with templates as well, a similar design situation normally comes together with use cases that require to determine the type of the combined callable objects at run-time.

std::function并且std::bind还为在 C++ 中启用函数式编程提供了一种自然的习惯用法，其中函数被视为对象并被自然地柯里化和组合以生成其他函数。虽然这种组合也可以通过模板来实现，但类似的设计情况通常伴随着需要在运行时确定组合的可调用对象的类型的用例。

Finally, there are other situations where std::functionis unavoidable, e.g. if you want to write recursive lambdas; however, these restrictions are more dictated by technological limitations than by conceptual distinctions I believe.

最后，还有其他std::function不可避免的情况，例如，如果您想编写递归 lambdas；然而，这些限制更多是由技术限制决定的，而不是我认为的概念差异。

To sum up, focus on designand try to understand what are the conceptual use cases for these two constructs. If you put them into comparison the way you did, you are forcing them into an arena they likely don't belong to.

总而言之，专注于设计并尝试理解这两个构造的概念用例是什么。如果您以这种方式将它们进行比较，那么您就是在迫使它们进入一个它们可能不属于的领域。

Answer 2

回答by Cassio Neri

Andy Prowl has nicely covered design issues. This is, of course, very important, but I believe the original question concerns more performance issues related to std::function.

Andy Prowl 很好地涵盖了设计问题。这当然非常重要，但我相信最初的问题涉及更多与std::function.

First of all, a quick remark on the measurement technique: The 11ms obtained for calc1has no meaning at all. Indeed, looking at the generated assembly (or debugging the assembly code), one can see that VS2012's optimizer is clever enough to realize that the result of calling calc1is independent of the iteration and moves the call out of the loop:

首先，快速评论一下测量技术：获得的 11mscalc1根本没有任何意义。确实，查看生成的程序集（或调试程序集代码），可以看到VS2012的优化器足够聪明，可以意识到调用的结果calc1与迭代无关，并将调用移出循环：

for (int i = 0; i < 1e8; ++i) {
}
calc1([](float arg){ return arg * 0.5f; });

Furthermore, it realises that calling calc1has no visible effect and drops the call altogether. Therefore, the 111ms is the time that the empty loop takes to run. (I'm surprised that the optimizer has kept the loop.) So, be careful with time measurements in loops. This is not as simple as it might seem.

此外，它意识到呼叫calc1没有可见效果并完全放弃呼叫。因此，111ms 是空循环运行所需的时间。（我很惊讶优化器一直保持循环。）所以，要小心循环中的时间测量。这并不像看起来那么简单。

As it has been pointed out, the optimizer has more troubles to understand std::functionand doesn't move the call out of the loop. So 1241ms is a fair measurement for calc2.

正如已经指出的那样，优化器在理解上有更多的麻烦，std::function并且不会将调用移出循环。所以 1241ms 是一个公平的衡量标准calc2。

Notice that, std::functionis able to store different types of callable objects. Hence, it must perform some type-erasure magic for the storage. Generally, this implies a dynamic memory allocation (by default through a call to new). It's well known that this is a quite costly operation.

请注意，std::function能够存储不同类型的可调用对象。因此，它必须为存储执行一些类型擦除魔术。通常，这意味着动态内存分配（默认情况下通过调用new）。众所周知，这是一项非常昂贵的操作。

The standard (20.8.11.2.1/5) encorages implementations to avoid the dynamic memory allocation for small objects which, thankfully, VS2012 does (in particular, for the original code).

标准 (20.8.11.2.1/5) 鼓励实现以避免小对象的动态内存分配，幸好 VS2012 做到了（特别是对于原始代码）。

To get an idea of how much slower it can get when memory allocation is involved, I've changed the lambda expression to capture three floats. This makes the callable object too big to apply the small object optimization:

为了了解在涉及内存分配时它会慢多少，我已经更改了 lambda 表达式以捕获三个floats。这使得可调用对象太大而无法应用小对象优化：

float a, b, c; // never mind the values
// ...
calc2([a,b,c](float arg){ return arg * 0.5f; });

For this version, the time is approximately 16000ms (compared to 1241ms for the original code).

对于此版本，时间约为 16000 毫秒（原始代码为 1241 毫秒）。

Finally, notice that the lifetime of the lambda encloses that of the std::function. In this case, rather than storing a copy of the lambda, std::functioncould store a "reference" to it. By "reference" I mean a std::reference_wrapperwhich is easily build by functions std::refand std::cref. More precisely, by using:

最后，请注意 lambda 的生命周期包含std::function. 在这种情况下，与其存储 lambda 的副本，不如存储std::function对它的“引用”。“引用”是指std::reference_wrapper可以通过函数std::ref和std::cref. 更准确地说，通过使用：

auto func = [a,b,c](float arg){ return arg * 0.5f; };
calc2(std::cref(func));

the time decreases to approximately 1860ms.

时间减少到大约 1860 毫秒。

I wrote about that a while ago:

我前段时间写过：

http://www.drdobbs.com/cpp/efficient-use-of-lambda-expressions-and/232500059

As I said in the article, the arguments don't quite apply for VS2010 due to its poor support to C++11. At the time of the writing, only a beta version of VS2012 was available but its support for C++11 was already good enough for this matter.

正如我在文章中所说，由于 VS2010 对 C++11 的支持很差，因此这些论点并不完全适用于 VS2010。在撰写本文时，只有 VS2012 的 beta 版本可用，但它对 C++11 的支持已经足够好。

Answer 3

回答by Johan Lundberg

With Clang there's no performance difference between the two

使用 Clang 两者之间没有性能差异

Using clang (3.2, trunk 166872) (-O2 on Linux), the binaries from the two cases are actually identical.

使用 clang (3.2, trunk 166872)（Linux 上的 -O2），这两种情况的二进制文件实际上是相同的。

-I'll come back to clang at the end of the post. But first, gcc 4.7.2:

- 我会在帖子结束时回到叮当声。但首先，gcc 4.7.2：

There's already a lot of insight going on, but I want to point out that the result of the calculations of calc1 and calc2 are not the same, due to in-lining etc. Compare for example the sum of all results:

已经有很多见解在进行，但我想指出 calc1 和 calc2 的计算结果是不一样的，由于内联等。比较例如所有结果的总和：

float result=0;
for (int i = 0; i < 1e8; ++i) {
  result+=calc2([](float arg){ return arg * 0.5f; });
}

with calc2 that becomes

calc2 变成

1.71799e+10, time spent 0.14 sec

while with calc1 it becomes

而使用 calc1 它变成

6.6435e+10, time spent 5.772 sec

that's a factor of ~40 in speed difference, and a factor of ~4 in the values. The first is a much bigger difference than what OP posted (using visual studio). Actually printing out the value a the end is also a good idea to prevent the compiler to removing code with no visible result (as-if rule). Cassio Neri already said this in his answer. Note how different the results are -- One should be careful when comparing speed factors of codes that perform different calculations.

这是速度差的约 40 倍，值的约 4 倍。第一个是比 OP 发布的内容（使用 Visual Studio）大得多的差异。实际上打印出值 a end 也是一个好主意，以防止编译器删除没有可见结果的代码（as-if 规则）。Cassio Neri 在他的回答中已经说过了。注意结果的不同——在比较执行不同计算的代码的速度因素时应该小心。

Also, to be fair, comparing various ways of repeatedly calculating f(3.3) is perhaps not that interesting. If the input is constant it should not be in a loop. (It's easy for the optimizer to notice)

此外，公平地说，比较重复计算 f(3.3) 的各种方法可能并不那么有趣。如果输入是恒定的，则不应处于循环中。（优化器很容易注意到）

If I add a user supplied value argument to calc1 and 2 the speed factor between calc1 and calc2 comes down to a factor of 5, from 40! With visual studio the difference is close to a factor of 2, and with clang there is no difference (see below).

如果我将用户提供的值参数添加到 calc1 和 2，则 calc1 和 calc2 之间的速度因子将从 40 降到因子 5！使用visual studio，差异接近2倍，而使用clang则没有差异（见下文）。

Also, as multiplications are fast, talking about factors of slow-down is often not that interesting. A more interesting question is, how small are your functions, and are these calls the bottleneck in a real program?

此外，由于乘法速度很快，谈论减速因素通常不是那么有趣。一个更有趣的问题是，你的函数有多小，这些调用是实际程序中的瓶颈吗？

Clang:

铛：

Clang (I used 3.2) actually produced identicalbinaries when I flip between calc1 and calc2 for the example code (posted below). With the original example posted in the question both are also identical but take no time at all (the loops are just completely removed as described above). With my modified example, with -O2:

当我在 calc1 和 calc2 之间切换示例代码（在下面发布）时，Clang（我使用 3.2）实际上生成了相同的二进制文件。对于问题中发布的原始示例，两者也是相同的，但根本不需要时间（如上所述，循环被完全删除）。使用我修改后的示例，使用 -O2：

Number of seconds to execute (best of 3):

执行的秒数（最好的 3）：

clang:        calc1:           1.4 seconds
clang:        calc2:           1.4 seconds (identical binary)

gcc 4.7.2:    calc1:           1.1 seconds
gcc 4.7.2:    calc2:           6.0 seconds

VS2012 CTPNov calc1:           0.8 seconds 
VS2012 CTPNov calc2:           2.0 seconds 

VS2015 (14.0.23.107) calc1:    1.1 seconds 
VS2015 (14.0.23.107) calc2:    1.5 seconds 

MinGW (4.7.2) calc1:           0.9 seconds
MinGW (4.7.2) calc2:          20.5 seconds

The calculated results of all binaries are the same, and all tests were executed on the same machine. It would be interesting if someone with deeper clang or VS knowledge could comment on what optimizations may have been done.

所有二进制文件的计算结果都相同，并且所有测试都在同一台机器上执行。如果有更深入的 clang 或 VS 知识的人可以评论可能已经完成了哪些优化，那将会很有趣。

My modified test code:

我修改后的测试代码：

#include <functional>
#include <chrono>
#include <iostream>

template <typename F>
float calc1(F f, float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

float calc2(std::function<float(float)> f,float x) { 
  return 1.0f + 0.002*x+f(x*1.223) ; 
}

int main() {
    using namespace std::chrono;

    const auto tp1 = high_resolution_clock::now();

    float result=0;
    for (int i = 0; i < 1e8; ++i) {
      result=calc1([](float arg){ 
          return arg * 0.5f; 
        },result);
    }
    const auto tp2 = high_resolution_clock::now();

    const auto d = duration_cast<milliseconds>(tp2 - tp1);  
    std::cout << d.count() << std::endl;
    std::cout << result<< std::endl;
    return 0;
}

Update:

更新：

Added vs2015. I also noticed that there are double->float conversions in calc1,calc2. Removing them does not change the conclusion for visual studio (both are a lot faster but the ratio is about the same).

添加了 vs2015。我还注意到在 calc1,calc2 中有 double->float 转换。删除它们不会改变 Visual Studio 的结论（两者都快得多，但比率大致相同）。

Answer 4

回答by Pete Becker

Different isn't the same.

不同不一样。

It's slower because it does things that a template can't do. In particular, it lets you call anyfunction that can be called with the given argument types and whose return type is convertible to the given return type from the same code.

它更慢，因为它做了模板不能做的事情。特别是，它允许您调用任何可以使用给定参数类型调用且其返回类型可从相同代码转换为给定返回类型的函数。

void eval(const std::function<int(int)>& f) {
    std::cout << f(3);
}

int f1(int i) {
    return i;
}

float f2(double d) {
    return d;
}

int main() {
    std::function<int(int)> fun(f1);
    eval(fun);
    fun = f2;
    eval(fun);
    return 0;
}

Note that the samefunction object, fun, is being passed to both calls to eval. It holds two differentfunctions.

请注意，相同的函数对象fun被传递给对的两个调用eval。它具有两种不同的功能。

If you don't need to do that, then you should notuse std::function.

如果你不需要做那些，那么你应该不使用std::function。

Answer 5

回答by TheAgitator

You already have some good answers here, so I'm not going to contradict them, in short comparing std::function to templates is like comparing virtual functions to functions. You never should "prefer" virtual functions to functions, but rather you use virtual functions when it fits the problem, moving decisions from compile time to run time. The idea is that rather than having to solve the problem using a bespoke solution (like a jump-table) you use something that gives the compiler a better chance of optimizing for you. It also helps other programmers, if you use a standard solution.

你在这里已经有了一些很好的答案，所以我不会反驳它们，简而言之，将 std::function 与模板进行比较就像将虚函数与函数进行比较。你永远不应该“更喜欢”虚函数而不是函数，而是在它适合问题时使用虚函数，将决策从编译时转移到运行时。这个想法是，您不必使用定制的解决方案（如跳转表）来解决问题，而是使用一些可以让编译器有更好机会为您进行优化的东西。如果您使用标准解决方案，它还可以帮助其他程序员。

Answer 6

回答by greggo

This answer is intended to contribute, to the set of existing answers, what I believe to be a more meaningful benchmark for the runtime cost of std::function calls.

这个答案旨在为现有答案集做出贡献，我认为这是 std::function 调用的运行时成本的更有意义的基准。

The std::function mechanism should be recognized for what it provides: Any callable entity can be converted to a std::function of appropriate signature. Suppose you have a library that fits a surface to a function defined by z = f(x,y), you can write it to accept a std::function<double(double,double)>, and the user of the library can easily convert any callable entity to that; be it an ordinary function, a method of a class instance, or a lambda, or anything that is supported by std::bind.

std::function 机制应该因其提供的功能而得到认可：任何可调用实体都可以转换为具有适当签名的 std::function。假设您有一个库，该库将曲面拟合为由 z = f(x,y) 定义的函数，您可以将其编写为接受 a std::function<double(double,double)>，并且库的用户可以轻松地将任何可调用实体转换为该函数；无论是普通函数，类实例的方法，还是 lambda，或者 std::bind 支持的任何东西。

Unlike template approaches, this works without having to recompile the library function for different cases; accordingly, little extra compiled code is needed for each additional case. It has always been possible to make this happen, but it used to require some awkward mechanisms, and the user of the library would likely need to construct an adapter around their function to make it work. std::function automatically constructs whatever adapter is needed to get a common runtimecall interface for all the cases, which is a new and very powerful feature.

与模板方法不同，这种方法无需针对不同情况重新编译库函数；因此，对于每个额外的情况，几乎不需要额外的编译代码。一直有可能实现这一点，但它曾经需要一些笨拙的机制，并且库的用户可能需要围绕他们的函数构建一个适配器才能使其工作。std::function 自动构造任何需要的适配器，以获得所有情况下的公共运行时调用接口，这是一个新的非常强大的功能。

To my view, this is the most important use case for std::function as far as performance is concerned: I'm interested in the cost of calling a std::function many times after it has been constructed once, and it needs to be a situation where the compiler is unable to optimize the call by knowing the function actually being called (i.e. you need to hide the implementation in another source file to get a proper benchmark).

在我看来，就性能而言，这是 std::function 最重要的用例：我对在构造一次后多次调用 std::function 的成本感兴趣，它需要是编译器无法通过知道实际调用的函数来优化调用的情况（即，您需要将实现隐藏在另一个源文件中以获得适当的基准）。

I made the test below, similar to the OP's; but the main changes are:

我做了下面的测试，类似于 OP；但主要的变化是：

Each case loops 1 billion times, but the std::function objects are constructed only once. I've found by looking at the output code that 'operator new' is called when constructing actual std::function calls (maybe not when they are optimized out).
Test is split into two files to prevent undesired optimization
My cases are: (a) function is inlined (b) function is passed by an ordinary function pointer (c) function is a compatible function wrapped as std::function (d) function is an incompatible function made compatible with a std::bind, wrapped as std::function

每个 case 循环 10 亿次，但 std::function 对象只构造一次。我通过查看输出代码发现在构造实际的 std::function 调用时调用了“operator new”（可能不是在它们被优化时）。
测试被分成两个文件以防止不需要的优化
我的情况是：(a) 函数被内联 (b) 函数由普通函数指针传递 (c) 函数是一个兼容的函数，包装为 std::function (d) 函数是一个与 std:: 兼容的不兼容函数绑定，包装为 std::function

The results I get are:

我得到的结果是：

case (a) (inline) 1.3 nsec
all other cases: 3.3 nsec.

情况 (a)（内联）1.3 纳秒
所有其他情况：3.3 纳秒。

Case (d) tends to be slightly slower, but the difference (about 0.05 nsec) is absorbed in the noise.

情况 (d) 往往稍慢，但差异（约 0.05 纳秒）被噪声吸收。

Conclusion is that the std::function is comparable overhead (at call time) to using a function pointer, even when there's simple 'bind' adaptation to the actual function. The inline is 2 ns faster than the others but that's an expected tradeoff since the inline is the only case which is 'hard-wired' at run time.

结论是 std::function 的开销（在调用时）与使用函数指针的开销相当，即使在对实际函数进行简单的“绑定”适应时也是如此。内联比其他快 2 ns，但这是预期的权衡，因为内联是唯一在运行时“硬连线”的情况。

When I run johan-lundberg's code on the same machine, I'm seeing about 39 nsec per loop, but there's a lot more in the loop there, including the actual constructor and destructor of the std::function, which is probably fairly high since it involves a new and delete.

当我在同一台机器上运行 johan-lundberg 的代码时，我看到每个循环大约需要 39 纳秒，但是那里的循环还有很多，包括 std::function 的实际构造函数和析构函数，这可能相当高因为它涉及到新的和删除的。

-O2 gcc 4.8.1, to x86_64 target (core i5).

-O2 gcc 4.8.1，到 x86_64 目标（核心 i5）。

Note, the code is broken up into two files, to prevent the compiler from expanding the functions where they are called (except in the one case where it's intended to).

请注意，代码被分成两个文件，以防止编译器在调用它们的地方扩展函数（除非是在一种情况下打算这样做）。

----- first source file --------------

----- 第一个源文件--------------

#include <functional>


// simple funct
float func_half( float x ) { return x * 0.5; }

// func we can bind
float mul_by( float x, float scale ) { return x * scale; }

//
// func to call another func a zillion times.
//
float test_stdfunc( std::function<float(float)> const & func, int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func(x);
    }
    return y;
}

// same thing with a function pointer
float test_funcptr( float (*func)(float), int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func(x);
    }
    return y;
}

// same thing with inline function
float test_inline(  int nloops ) {
    float x = 1.0;
    float y = 0.0;
    for(int i =0; i < nloops; i++ ){
        y += x;
        x = func_half(x);
    }
    return y;
}

----- second source file -------------

----- 第二个源文件 -------------

#include <iostream>
#include <functional>
#include <chrono>

extern float func_half( float x );
extern float mul_by( float x, float scale );
extern float test_inline(  int nloops );
extern float test_stdfunc( std::function<float(float)> const & func, int nloops );
extern float test_funcptr( float (*func)(float), int nloops );

int main() {
    using namespace std::chrono;


    for(int icase = 0; icase < 4; icase ++ ){
        const auto tp1 = system_clock::now();

        float result;
        switch( icase ){
         case 0:
            result = test_inline( 1e9);
            break;
         case 1:
            result = test_funcptr( func_half, 1e9);
            break;
         case 2:
            result = test_stdfunc( func_half, 1e9);
            break;
         case 3:
            result = test_stdfunc( std::bind( mul_by, std::placeholders::_1, 0.5), 1e9);
            break;
        }
        const auto tp2 = high_resolution_clock::now();

        const auto d = duration_cast<milliseconds>(tp2 - tp1);  
        std::cout << d.count() << std::endl;
        std::cout << result<< std::endl;
    }
    return 0;
}

For those interested, here's the adaptor the compiler built to make 'mul_by' look like a float(float) - this is 'called' when the function created as bind(mul_by,_1,0.5) is called:

对于那些感兴趣的人，这是编译器构建的适配器，使 'mul_by' 看起来像一个 float(float) - 当调用创建为 bind(mul_by,_1,0.5) 的函数时，它被“调用”：

movq    (%rdi), %rax                ; get the std::func data
movsd   8(%rax), %xmm1              ; get the bound value (0.5)
movq    (%rax), %rdx                ; get the function to call (mul_by)
cvtpd2ps    %xmm1, %xmm1        ; convert 0.5 to 0.5f
jmp *%rdx                       ; jump to the func

(so it might have been a bit faster if I'd written 0.5f in the bind...) Note that the 'x' parameter arrives in %xmm0 and just stays there.

（所以如果我在绑定中写了 0.5f 可能会快一点...）请注意，'x' 参数到达 %xmm0 并停留在那里。

Here's the code in the area where the function is constructed, prior to calling test_stdfunc - run through c++filt :

这是在调用 test_stdfunc 之前构造函数的区域中的代码 - 通过 c++filt 运行：

movl    , %edi
movq    template <typename F>
float calc1(F f, float i) { return -1.0f * f(i) + 666.0f; }
float calc2(std::function<float(float)> f, float i) { return -1.0f * f(i) + 666.0f; }
int main() {
    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        t += calc2([&](float arg){ return arg * 0.5f + t; }, i);
    }
    const auto tp2 = high_resolution_clock::now();
}
, 32(%rsp)
call    operator new(unsigned long)      ; get 16 bytes for std::function
movsd   .LC0(%rip), %xmm1                ; get 0.5
leaq    16(%rsp), %rdi                   ; (1st parm to test_stdfunc) 
movq    mul_by(float, float), (%rax)     ; store &mul_by  in std::function
movl    00000000, %esi                ; (2nd parm to test_stdfunc)
movsd   %xmm1, 8(%rax)                   ; store 0.5 in std::function
movq    %rax, 16(%rsp)                   ; save ptr to allocated mem

   ;; the next two ops store pointers to generated code related to the std::function.
   ;; the first one points to the adaptor I showed above.

movq    std::_Function_handler<float (float), std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_invoke(std::_Any_data const&, float), 40(%rsp)
movq    std::_Function_base::_Base_manager<std::_Bind<float (*(std::_Placeholder<1>, double))(float, float)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation), 32(%rsp)


call    test_stdfunc(std::function<float (float)> const&, int)

Answer 7

回答by Joshua Ritterman

I found your results very interesting so I did a bit of digging to understand what is going on. First off as many others have said with out having the results of the computation effect the state of the program the compiler will just optimize this away. Secondly having a constant 3.3 given as an armament to the callback I suspect that there will be other optimizations going on. With that in mind I changed your benchmark code a little bit.

我发现你的结果非常有趣，所以我做了一些挖掘以了解发生了什么。首先，正如许多其他人所说，在没有计算结果的情况下，编译器只会优化程序的状态。其次，有一个常量 3.3 作为回调的武器，我怀疑还会有其他优化。考虑到这一点，我稍微更改了您的基准代码。

.L34:
cvtsi2ss        %edx, %xmm0
addl    , %edx
movaps  %xmm3, %xmm5
mulss   %xmm4, %xmm0
addss   %xmm1, %xmm0
subss   %xmm0, %xmm5
movaps  %xmm5, %xmm0
addss   %xmm1, %xmm0
cvtsi2sd        %edx, %xmm1
ucomisd %xmm1, %xmm2
ja      .L37
movss   %xmm0, 16(%rsp)

Given this change to the code I compiled with gcc 4.8 -O3 and got a time of 330ms for calc1 and 2702 for calc2. So using the template was 8 times faster, this number looked suspects to me, speed of a power of 8 often indicates that the compiler has vectorized something. when I looked at the generated code for the templates version it was clearly vectoreized

考虑到我使用 gcc 4.8 -O3 编译的代码的这一更改，calc1 的时间为 330 毫秒，calc2 的时间为 2702。所以使用模板的速度提高了 8 倍，这个数字对我来说看起来很可疑，8 的幂的速度通常表明编译器已经向量化了一些东西。当我查看模板版本的生成代码时，它显然是矢量化的

float calc3(float i) {  return -1.0f * f2(i) + 666.0f; }
std::function<float(float)> f2 = [](float arg){ return arg * 0.5f; };

int main() {
    const auto tp1 = system_clock::now();
    for (int i = 0; i < 1e8; ++i) {
        t += calc3([&](float arg){ return arg * 0.5f + t; }, i);
    }
    const auto tp2 = high_resolution_clock::now();
}

Where as the std::function version was not. This makes sense to me, since with the template the compiler knows for sure that the function will never change throughout the loop but with the std::function being passed in it could change, therefor can not be vectorized.

哪里 std::function 版本不是。这对我来说很有意义，因为使用模板，编译器肯定知道该函数在整个循环中永远不会改变，但是传入的 std::function 可能会改变，因此不能被向量化。

This led me to try something else to see if I could get the compiler to perform the same optimization on the std::function version. Instead of passing in a function I make a std::function as a global var, and have this called.

这促使我尝试其他方法，看看是否可以让编译器对 std::function 版本执行相同的优化。我没有传入函数，而是将 std::function 作为全局变量，并调用它。

##代码##

With this version we see that the compiler has now vectorized the code in the same way and I get the same benchmark results.

在这个版本中，我们看到编译器现在以相同的方式对代码进行了矢量化，我得到了相同的基准测试结果。

template : 330ms
std::function : 2702ms
global std::function: 330ms

模板：330ms
标准::功能：2702毫秒
全局 std::function: 330ms

So my conclusion is the raw speed of a std::function vs a template functor is pretty much the same. However it makes the job of the optimizer much more difficult.

所以我的结论是 std::function 与模板函子的原始速度几乎相同。然而，它使优化器的工作变得更加困难。

C++ std::function 与模板

提问by Red XIII

回答by Andy Prowl

回答by Cassio Neri

回答by Johan Lundberg

With Clang there's no performance difference between the two

使用 Clang 两者之间没有性能差异

Clang:

铛：

My modified test code:

我修改后的测试代码：

回答by Pete Becker

回答by TheAgitator

回答by greggo

回答by Joshua Ritterman

相关推荐

最近更新

标签

C++ std::function 与模板

提问by Red XIII

回答by Andy Prowl

回答by Cassio Neri

回答by Johan Lundberg

With Clang there's no performance difference between the two

使用 Clang 两者之间没有性能差异

Clang:

铛：

My modified test code:

我修改后的测试代码：

回答by Pete Becker

回答by TheAgitator

回答by greggo

回答by Joshua Ritterman

相关推荐

C++ Poco::Net 服务器和客户端 TCP 连接事件处理程序

C++ 计算比 double 或 long double 更精确

C++ 使用Boost读写XML文件

C++ 获取错误浮点异常：8

相关推荐

最近更新

标签