C++ 超出 -O3/-Ofast 的 G++ 优化

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14492436/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 18:22:45  来源:igfitidea点击:

G++ optimization beyond -O3/-Ofast

c++g++compiler-optimization

提问by Haatschii

The Problem

问题

We have a mid-sized program for a simulation task, that we need to optimize. We have already done our best optimizing the source to the limit of our programming skills, including profiling with Gprofand Valgrind.

我们有一个用于模拟任务的中型程序,我们需要对其进行优化。我们已经尽最大努力将源代码优化到我们编程技能的极限,包括使用GprofValgrind 进行分析。

When finally finished, we want to run the program on several systems probably for some months. Therefore we are really interested in pushing the optimization to the limits.

最终完成后,我们希望在多个系统上运行该程序可能几个月。因此,我们非常有兴趣将优化推向极限。

All systems will run Debian/Linux on relatively new hardware (Intel i5 or i7).

所有系统都将在相对较新的硬件(Intel i5 或 i7)上运行 Debian/Linux。

The Question

问题

What are possible optimization options using a recent version of g++, that go beyond -O3/-Ofast?

除了 -O3/-Ofast 之外,使用最新版本的 g++ 有哪些可能的优化选项?

We are also interested in costly minor optimization, that will payout in the long run.

我们也对代价高昂的小优化感兴趣,从长远来看,这将带来回报。

What we use right now

我们现在使用的

Right now we use the following g++ optimization options:

现在我们使用以下 g++ 优化选项:

  • -Ofast: Highest "standard" optimization level. The included -ffast-mathdid not cause any problems in our calculations, so we decided to go for it, despite of the non standard-compliance.
  • -march=native: Enabling the use of all CPU specific instructions.
  • -fltoto allow link time optimization, across different compilation units.
  • -Ofast:最高“标准”优化级别。包含-ffast-math在我们的计算中没有引起任何问题,所以我们决定去尝试,尽管不符合标准。
  • -march=native: 启用所有 CPU 特定指令的使用。
  • -flto允许跨不同编译单元的链接时间优化。

回答by Pyves

Most of the answers suggest alternative solutions, such as different compilers or external libraries, which would most likely bring a lot of rewriting or integration work. I will try to stick to what the question is asking, and focus on what can be done with GCC alone, by activating compiler flags or doing minimal changes to the code, as requested by the OP. This is not a "you must do this" answer, but more a collection of GCC tweaks that have worked out well for me and that you can give a try if they are relevant in your specific context.

大多数答案都提出了替代解决方案,例如不同的编译器或外部库,这很可能会带来大量的重写或集成工作。我将尝试坚持问题所问的问题,并专注于单独使用 GCC 可以完成的工作,根据 OP 的要求,通过激活编译器标志或对代码进行最少的更改。这不是“您必须这样做”的答案,而是更多 GCC 调整的集合,这些调整对我来说效果很好,如果它们与您的特定上下文相关,您可以尝试一下。



Warnings regarding original question

关于原始问题的警告

Before going into the details, a few warning regarding the question, typically for people who will come along, read the question and say "the OP is optimising beyond O3, I should use the same flags than he does!".

在进入细节之前,关于这个问题的一些警告,通常是那些会来的人,阅读问题并说“OP 正在优化超出 O3,我应该使用与他相同的标志!”。

  • -march=nativeenables usage of instructions specific to a given CPU architecture, and that are not necessarily available on a different architecture. The program may not work at all if run on a system with a different CPU, or be significantly slower (as this also enables mtune=native), so be aware of this if you decide to use it. More information here.
  • -Ofast, as you stated, enables some non standard compliantoptimisations, so it should used with caution as well. More information here.
  • -march=native允许使用特定于给定 CPU 架构指令,而这些指令不一定在不同的架构上可用。如果在具有不同 CPU 的系统上运行该程序可能根本无法运行,或者速度明显变慢(因为这也会启用mtune=native),因此如果您决定使用它,请注意这一点。更多信息在这里
  • -Ofast,正如您所说,启用了一些不符合标准的优化,因此也应谨慎使用。更多信息在这里

Other GCC flags to try out

其他要尝试的 GCC 标志

The details for the different flags are listed here.

此处列出了不同标志的详细信息。

  • -Ofastenables -ffast-math, which in turn enables -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nansand -fcx-limited-range. You can go even further on floating point calculation optimisationsby selectively adding some extra flagssuch as -fno-signed-zeros, -fno-trapping-mathand others. These are not included in -Ofastand can give some additional performance increases on calculations, but you must check whether they actually benefit you and don't break any calculations.
  • GCC also features a large amount of other optimisation flagswhich aren't enabled by any "-O" options. They are listed as "experimental options that may produce broken code", so again, they should be used with caution, and their effects checked both by testing for correctness and benchmarking. Nevertheless, I do often use -frename-registers, this option has never produced unwanted results for me and tends to give a noticeable performance increase (ie. can be measured when benchmarking). This is the type of flag that is very dependant on your processor though. -funroll-loopsalso sometimes gives good results (and also implies -frename-registers), but it is dependent on your actual code.
  • -Ofast使-ffast-math,这反过来又使-fno-math-errno-funsafe-math-optimizations-ffinite-math-only-fno-rounding-math-fno-signaling-nans-fcx-limited-range。您可以通过有选择地添加一些额外的标志(例如、和其他标志)进一步优化浮点计算。这些不包括在计算中,并且可以提供一些额外的性能提升,但您必须检查它们是否真的对您有益并且不会破坏任何计算。-fno-signed-zeros-fno-trapping-math-Ofast
  • GCC 还具有大量其他优化标志,任何“-O”选项都无法启用这些标志。它们被列为“可能会产生损坏代码的实验性选项”,因此再次强调,应谨慎使用它们,并通过测试正确性和基准测试来检查它们的效果。尽管如此,我确实经常使用-frename-registers,此选项从未对我产生不需要的结果,并且往往会显着提高性能(即可以在基准测试时进行测量)。不过,这是一种非常依赖于您的处理器的标志类型。-funroll-loops有时也会给出很好的结果(也暗示-frename-registers),但这取决于您的实际代码。

PGO

PGO

GCC has Profile-Guided Optimisationsfeatures. There isn't a lot of precise GCC documentation about it, but nevertheless getting it to run is quite straightforward.

GCC 具有配置文件引导的优化功能。没有很多关于它的精确 GCC 文档,但是让它运行非常简单。

  • first compile your program with -fprofile-generate.
  • let the program run (the execution time will be significantly slower as the code is also generating profile information into .gcda files).
  • recompile the program with -fprofile-use. If your application is multi-threaded also add the -fprofile-correctionflag.
  • 首先用-fprofile-generate.
  • 让程序运行(执行时间会明显变慢,因为代码还会将配置文件信息生成到 .gcda 文件中)。
  • 重新编译程序-fprofile-use。如果您的应用程序是多线程的,还要添加该-fprofile-correction标志。

PGO with GCC can give amazing results and really significantly boost performance (I've seen a 15-20% speed increase on one of the projects I was recently working on). Obviously the issue here is to have some data that is sufficiently representativeof your application's execution, which is not always available or easy to obtain.

带有 GCC 的 PGO 可以提供惊人的结果并真正显着提高性能(我看到我最近从事的一个项目的速度提高了 15-20%)。显然,这里的问题是要有一些足以代表应用程序执行情况的数据,这些数据并不总是可用或容易获得。

GCC's Parallel Mode

GCC的并行模式

GCC features a Parallel Mode, which was first released around the time where the GCC 4.2 compiler was out.

GCC 具有Parallel Mode,该模式在 GCC 4.2 编译器发布时首次发布。

Basically, it provides you with parallel implementations of many of the algorithms in the C++ Standard Library. To enable them globally, you just have to add the -fopenmpand the -D_GLIBCXX_PARALLELflags to the compiler. You can also selectively enable each algorithm when needed, but this will require some minor code changes.

基本上,它为您提供了 C++ 标准库中许多算法的并行实现。要全局启用它们,您只需将-fopenmp-D_GLIBCXX_PARALLEL标志添加到编译器。您还可以在需要时有选择地启用每个算法,但这将需要一些小的代码更改。

All the information about this parallel mode can be found here.

可以在此处找到有关此并行模式的所有信息。

If you frequently use these algorithms on large data structures, and have many hardware thread contexts available, these parallel implementations can give a huge performance boost. I have only made use of the parallel implementation of sortso far, but to give a rough idea I managed to reduce the time for sorting from 14 to 4 seconds in one of my applications (testing environment: vector of 100 millions objects with custom comparator function and 8 cores machine).

如果您经常在大型数据结构上使用这些算法,并且有许多可用的硬件线程上下文,那么这些并行实现可以带来巨大的性能提升。sort到目前为止,我只使用了并行实现,但为了给出一个粗略的想法,我设法将我的一个应用程序中的排序时间从 14 秒减少到 4 秒(测试环境:具有自定义比较器函数的 1 亿个对象的向量和 8 核机器)。

Extra tricks

额外的技巧

Unlike the previous points sections, this part does require some small changes in the code. They are also GCC specific (some of them work on Clang as well), so compile time macros should be used to keep the code portable on other compilers. This section contains some more advanced techniques, and should not be used if you don't have some assembly level understanding of what's going on. Also note that processors and compilers are pretty smart nowadays, so it may be tricky to get any noticeable benefit from the functions described here.

与前面的要点部分不同,这部分确实需要对代码进行一些小的更改。它们也是特定于 GCC 的(其中一些也适用于 Clang),因此应使用编译时宏来保持代码在其他编译器上的可移植性。本节包含一些更高级的技术,如果您对正在发生的事情没有一些汇编级别的理解,则不应使用。还要注意,处理器和编译器现在非常智能,因此从这里描述的函数中获得任何明显的好处可能很棘手。

  • GCC builtins, which are listed here. Constructs such as __builtin_expectcan help the compiler do better optimisations by providing it with branch predictioninformation. Other constructs such as __builtin_prefetchbrings data into a cache before it is accessed and can help reducing cache misses.
  • function attributes, which are listed here. In particular, you should look into the hotand coldattributes; the former will indicate to the compiler that the function is a hotspotof the program and optimise the function more aggressively and place it in a special subsection of the text section, for better locality; the later will optimise the function for size and place it in another special subsection of the text section.
  • GCC 内置函数,在此处列出。诸如此类的构造__builtin_expect可以通过向编译器提供分支预测信息来帮助编译器进行更好的优化。其他结构(例如__builtin_prefetch在访问数据之前将数据放入缓存)可以帮助减少缓存未命中
  • 此处列出的函数属性。特别是,您应该查看hotcold属性;前者会向编译器表明该函数是程序的热点,并更积极地优化该函数并将其放在文本段的特殊子段中,以获得更好的局部性;后者将优化函数的大小并将其放置在文本部分的另一个特殊子部分中。


I hope this answer will prove useful for some developers, and I will be glad to consider any edits or suggestions.

我希望这个答案对一些开发人员有用,我很乐意考虑任何编辑或建议。

回答by Mikael Persson

relatively new hardware (Intel i5 or i7)

相对较新的硬件(Intel i5 或 i7)

Why not invest in a copy of the Intel compilerand high performance libraries? It can outperform GCC on optimizations by a significant margin, typically from 10% to 30% or even more, and even more so for heavy number-crunching programs. And Intel also provide a number of extensions and libraries for high-performance number-crunching (parallel) applications, if that's something you can afford to integrate into your code. It might payoff big if it ends up saving you months of running time.

为什么不投资购买英特尔编译器和高性能库的副本?它在优化方面的表现可以明显优于 GCC,通常从 10% 到 30% 甚至更多,对于繁重的数字运算程序更是如此。并且英特尔还为高性能数字处理(并行)应用程序提供了许多扩展和库,如果您有能力将其集成到您的代码中。如果它最终为您节省了数月的运行时间,它可能会带来巨大的回报。

We have already done our best optimizing the source to the limit of our programming skills

我们已经尽最大努力将源代码优化到我们编程技能的极限

In my experience, the kind of micro- and nano- optimizations that you typically do with the help of a profiler tend to have a poor return on time-investments compared to macro-optimizations (streamlining the structure of the code) and, most importantly and often overlooked, memory access optimizations (e.g., locality of reference, in-order traversal, minimizing indirection, wielding out cache-misses, etc.). The latter usually involves designing the memory structures to better reflect the way the memory is used (traversed). Sometimes it can be as simple as switching a container type and getting a huge performance boost from that. Often, with profilers, you get lost in the details of the instruction-by-instruction optimizations, and memory layout issues don't show up and are usually missed when forgetting to look at the bigger picture. It's a much better way to invest your time, and the payoffs can be huge (e.g., many O(logN) algorithms end up performing almost as slow as O(N) just because of poor memory layouts (e.g., using a linked-list or linked-tree is a typical culprit of huge performance problems compared to a contiguous storage strategy)).

根据我的经验,与宏观优化(简化代码结构)相比,您通常在分析器的帮助下进行的那种微优化和纳米优化的时间投资回报往往很低,而且最重要的是并且经常被忽视的是内存访问优化(例如,引用的局部性、有序遍历、最小化间接访问、消除缓存未命中等)。后者通常涉及设计内存结构以更好地反映使用(遍历)内存的方式。有时它可以像切换容器类型并从中获得巨大的性能提升一样简单。通常,使用分析器时,您会迷失在逐条指令优化的细节中,并且内存布局问题不会出现,并且通常会在忘记查看大图时被忽略。它'

回答by Red XIII

If you can afford it, try VTune. It provides MUCH more info than simple sampling (provided by gprof, as far as I know). You might give the Code Analysta try. Latter is a decent, free software but it might not work correctly (or at all) with Intel CPUs.

如果您负担得起,请尝试VTune。它提供了比简单采样更多的信息(据我所知由 gprof 提供)。您可以试试Code Analyst。后者是一款不错的免费软件,但它可能无法在 Intel CPU 上正常工作(或根本无法正常工作)。

Being equipped with such tool, it allows you to check various measure such as cache utilization (and basically memory layout), which - if used to its full extend - provides a huge boost to efficiency.

配备了这样的工具,它允许您检查各种措施,例如缓存利用率(以及基本的内存布局),如果使用它的全部扩展,它可以极大地提高效率。

When you are sure that you algorithms and structures are optimal, then you should definitely use the multiple cores on i5 and i7. In other words, play around with different parallel programming algorithms/patterns and see if you can get a speed up.

当你确定你的算法和结构是最优的,那么你绝对应该在 i5 和 i7 上使用多核。换句话说,尝试不同的并行编程算法/模式,看看是否可以加快速度。

When you have truly parallel data (array-like structures on which you perform similar/same operations) you should give OpenCL and SIMD instructions(easier to set up) a try.

当您拥有真正的并行数据(在其上执行类似/相同操作的类数组结构)时,您应该尝试使用OpenCL 和SIMD 指令(更容易设置)。

回答by zaufi

huh, then final thing you may try: ACOVEAproject: Analysis of Compiler Optimizations via an Evolutionary Algorithm -- as obvious from the description, it tries a genetic algorithm to pick the best compiler options for your project (doing compilation maaany times and check for timing, giving a feedback to the algorithm :) -- but results could be impressive! :)

嗯,那么您可以尝试最后一件事:ACOVEA项目:通过进化算法分析编译器优化——从描述中可以明显看出,它尝试使用遗传算法为您的项目选择最佳编译器选项(多次编译并检查时机,向算法提供反馈:)——但结果可能令人印象深刻!:)

回答by user3708067

Some notes about the currently chosen answer (I do not have enough reputation points yet to post this as a comment):

关于当前选择的答案的一些说明(我还没有足够的声望点将其作为评论发布):

The answer says:

答案说:

-fassociative-math, -freciprocal-math, -fno-signed-zeros, and -fno-trapping-math. These are not included in -Ofastand can give some additional performance increases on calculations

-fassociative-math-freciprocal-math-fno-signed-zeros,和-fno-trapping-math。这些不包括在-Ofast计算中,并且可以提供一些额外的性能提升

Perhaps this was true when the answer was posted, but the GCC documentationsays that all of these are enabled by -funsafe-math-optimizations, which is enabled by -ffast-math, which is enabled by -Ofast. This can be checked with the command gcc -c -Q -Ofast --help=optimizer, which shows which optimizations are enabled by -Ofast, and confirms that all of these are enabled.

也许在发布答案时确实如此,但是GCC 文档说所有这些都由 启用,由-funsafe-math-optimizations启用,由-ffast-math启用-Ofast。这可以使用命令来检查,该命令gcc -c -Q -Ofast --help=optimizer显示启用了哪些优化-Ofast,并确认所有这些都已启用。

The answer also says:

答案还说:

other optimisation flags which aren't enabled by any "-O" options... -frename-registers

任何“-O”选项未启用的其他优化标志... -frename-registers

Again, the above command shows that, at least with my GCC 5.4.0, -frename-registersis enabled by default with -Ofast.

同样,上面的命令显示,至少在我的 GCC 5.4.0 中,-frename-registers默认情况下启用了-Ofast.

回答by Escualo

It is difficult to answer without further detail:

没有进一步的细节很难回答:

  • what type of number crunching?
  • what libraries are you using?
  • what degree of paralelization?
  • 什么类型的数字运算?
  • 你在使用什么库?
  • 什么程度的并行化?

Can you write down the part of your code which takes the longest? (Typically a tight loop)

你能写下代码中耗时最长的部分吗?(通常是一个紧密的循环)

If you are CPU bound the answer will be different than if you are IO bound.

如果您受 CPU 限制,答案将与受 IO 限制的情况不同。

Again, please provide further detail.

再次,请提供更多详细信息。

回答by uLoop

I would recommend taking a look at the type of operations that costitute the heavy lifting, and look for an optimized library. There are quite a lot of fast, assembly optimized, SIMD vectorized libraries out there for common problems (mostly math). Reinventing the wheel is often tempting, but it is usually not worth the effort if an existing soltuion can cover your needs.Since you have not stated what sort of simulation it is I can only provide some examples.

我建议查看构成繁重工作的操作类型,并寻找优化的库。有很多针对常见问题(主要是数学)的快速、汇编优化、SIMD 矢量化库。重新发明轮子通常很诱人,但如果现有解决方案可以满足您的需求,通常不值得付出努力。由于您没有说明它是哪种模拟,我只能提供一些示例。

http://www.yeppp.info/

http://www.yeppp.info/

http://eigen.tuxfamily.org/index.php?title=Main_Page

http://eigen.tuxfamily.org/index.php?title=Main_Page

https://github.com/xianyi/OpenBLAS

https://github.com/xianyi/OpenBLAS

回答by xTrameshmen

with gcc intel turn of / implement -fno-gcse (works well on gfortran) and -fno-guess-branch-prbability (default in gfortran)

使用 gcc intel 打开/实现 -fno-gcse(在 gfortran 上运行良好)和 -fno-guess-branch-prbability(gfortran 中的默认值)