C语言 GCC 循环展开标志真的有效吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24196076/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-02 11:10:00  来源:igfitidea点击:

Is GCC loop unrolling flag really effective?

cgccgcc4.8

提问by AndreaF

In C, I have a task where I must do multiplication, inversion, trasposition, addition etc. etc. with hugematrices allocated as 2-dimensional arrays, (arrays of arrays).

在 C 中,我有一个任务,我必须使用分配为二维数组(数组数组)的巨大矩阵进行乘法、求逆、转置、加法等。

I have found the gcc flag -funroll-all-loops. If I understand correctly, this will unroll all loops automatically without any efforts by the programmer.

我找到了 gcc 标志-funroll-all-loops。如果我理解正确,这将自动展开所有循环,而无需程序员的任何努力。

My questions:

我的问题:

a)Does gcc include this kind of optimization with the various optimization flags as -O1, -O2etc.?

a)gcc 是否包括这种优化以及各种优化标志-O1-O2等等?

b)Do I have to use any pragmas inside my code to take advantage of loop unrolling or are loops identified automatically?

b)我是否必须pragma在我的代码中使用任何s 才能利用循环展开或自动识别循环?

c)Why is this option not enabled by default if the unrolling increases the performance?

c)如果展开提高性能,为什么默认情况下不启用此选项?

d)What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=nativeand -O2flags)

d)推荐的 gcc 优化标志是什么以尽可能最好的方式编译程序?(我必须运行这个针对单个 CPU 系列优化的程序,这与我编译代码的机器相同,实际上我使用march=native-O2标记)

EDIT

编辑

Seems that there are controversities about the use of unroll that in some cases may slow down the performance. In my situations there are various methods that do simply math operations in 2 nested for cycles for iterate matrix elements done for an huge amount of elements. In this scenario how unroll could slow down or increase the performance?

似乎关于使用 unroll 存在争议,在某些情况下可能会降低性能。在我的情况下,有多种方法可以在 2 个嵌套循环中进行简单的数学运算,以迭代为大量元素完成的矩阵元素。在这种情况下,展开如何减慢或提高性能?

回答by etheranger

Why unroll loops?

为什么展开循环?

Modern processors pipeline instructions. They like knowing what's coming next and make all sorts of fancy optimisations based on assumptions of which order the instructions should be executed.

现代处理器流水线指令。他们喜欢知道接下来会发生什么,并根据指令应该执行的顺序进行各种花哨的优化。

At the end of a loop though, there are two possibilities! Either you go back to the top, or continue on. The processor makes an educated guess on which is going to happen. If it gets it right, everything is good. If not, it has to flush the pipeline and stall for a bit while it prepares for taking the other branch.

但是,在循环结束时,有两种可能性!要么回到顶部,要么继续。处理器对将要发生的情况进行有根据的猜测。如果它做对了,一切都很好。如果没有,它必须刷新管道并在准备接受另一个分支时暂停一段时间。

As you can imagine, unrolling a loop eliminates branches and the potential for those stalls, especially in cases where the odds are against a guess.

可以想象,展开循环可以消除分支和出现停顿的可能性,尤其是在可能性与猜测相反的情况下。

Imagine a loop of code that executes 3 times, then continues. If you assume (as the processor probably would) that at the end you'll repeat the loop. 2/3 of the time, you'll be correct! 1/3 of the time though, you'll stall.

想象一个执行 3 次然后继续的代码循环。如果您假设(就像处理器可能会那样)最后您将重复循环。2/3 的时间,你会是对的!但是,有 1/3 的时间,您会停滞不前。

On the other hand, imagine the same situation, but the code loops 3000 times. Here, there's probably only a gain 1/3000 of the time from unrolling.

另一方面,想象同样的情况,但代码循环了 3000 次。在这里,展开的时间可能只有 1/3000。

Why notunroll loops?

为什么展开循环?

Part of the processor fanciness mentioned above involves loading the instructions from the executable in memory into the processor's onboard instruction cache (shortened to I-cache). This holds a limited amount of instructions which can be accessed quickly, but may stall when new instructions need to be loaded from memory.

上面提到的处理器奇想的一部分涉及将指令从内存中的可执行文件加载到处理器的板载指令缓存(缩写为 I-cache)中。这包含可以快速访问的有限数量的指令,但是当需要从内存加载新指令时可能会停止。

Let's go back to the previous examples. Assume a reasonably small amount of code inside the loop takes up nbytes of I-cache. If we unroll the loop, it's now taking up n * 3bytes. A bit more, but it'll probably fit in a single cache line just fine so your cache will be working optimally and not needing to stall reading from main memory.

让我们回到前面的例子。假设循环内相当少量的代码占用了nI-cache 的字节。如果我们展开循环,它现在占用了n * 3字节。多一点,但它可能适合单个缓存行就好了,因此您的缓存将以最佳方式工作,而无需停止从主内存读取。

The 3000-loop, however, unrolls to use a whopping n * 3000bytes of I-cache. That's going to require several reads from memory, and probably push some other useful stuff from elsewhere in the program out of the I-cache.

然而,3000 次循环展开以使用大量n * 3000字节的 I-cache。这将需要从内存中读取几次,并且可能将程序中其他地方的一些其他有用的东西从 I-cache 中推出。

So what do I do?

那我该怎么办?

As you can see, unrolling provides more benefits for shorter loops but ends up trashing performance if you're intending to loop a large number of times.

如您所见,展开为较短的循环提供了更多好处,但如果您打算进行大量循环,则最终会降低性能。

Usually, a smart compiler will take a decent guess about which loops to unroll but you can force it if you're sureyou know better. How do you get to know better? The only way is to try it both ways and compare timings!

通常,智能编译器会合理猜测要展开哪些循环,但如果您确定自己知道得更好,则可以强制进行。你如何更好地了解?唯一的方法是尝试两种方式并比较时间!

Premature optimization is the root of all evil-- Donald Knuth

过早优化是万恶之源——Donald Knuth

Profile first, optimise later.

先配置,后优化。

回答by Guido

Loop unrolling does not work if the compiler can't predict the exact amount of iterations of the loop at compile time (or at least predict an upper bound, and then skip as many iterations as needed). This means that if your matrix size is variable, the flag will have no effect.

如果编译器无法在编译时预测循环的确切迭代次数(或至少预测上限,然后根据需要跳过尽可能多的迭代),则循环展开不起作用。这意味着如果您的矩阵大小可变,则该标志将不起作用。

Now to answer your questions:

现在回答您的问题:

a) Does gcc include this kind of optimization with the various optimization flags as -O1, -O2 etc.?

a) gcc 是否包含这种带有各种优化标志(如 -O1、-O2 等)的优化?

Nope, you have to explicitly set it since it may or may not make the code run faster and it usually makes the executable bigger.

不,您必须明确设置它,因为它可能会或可能不会使代码运行得更快,并且通常会使可执行文件更大。

b) Do I have to use any pragmas inside my code to take advantage of loop unrolling or are loops identified automatically?

b) 我是否必须在我的代码中使用任何编译指示来利用循环展开或自动识别循环?

No pragmas. With -funroll-loopsthe compiler heuristically decides which loops to unroll. If you want to force unrolling you can use -funroll-all-loops, but it usually makes the code run slower.

没有编译指示。随着-funroll-loops编译器直观地决定哪些循环解开。如果您想强制展开,您可以使用-funroll-all-loops,但它通常会使代码运行速度变慢。

c) Why is this option not enabled by default if the unrolling increases the performance?

c) 如果展开提高性能,为什么默认情况下不启用此选项?

It doesn't alwaysincrease performance! Also, not everything is about performance. Some people actually care about having small executables since they have little memory (see: embedded systems)

它并不总是提高性能!此外,并非一切都与性能有关。有些人实际上关心具有小的可执行文件,因为它们的内存很少(请参阅:嵌入式系统)

d) What are the recommended gcc optimization flags to compile the program in the best way possible? (I must run this program optimized for a single CPU family, that is the same of the machine where I compile the code, actually I use march=native and -O2 flags)

d) 推荐的 gcc 优化标志是什么以尽可能最好的方式编译程序?(我必须运行针对单个 CPU 系列优化的程序,这与我编译代码的机器相同,实际上我使用了 March=native 和 -O2 标志)

There's no silver bullet. You'll need to think, test and see. There is actually a theorem that states that no perfect compiler can ever exist.

没有银弹。你需要思考、测试和观察。实际上有一个定理指出,永远不存在完美的编译器。

Did you profile your program? Profiling is a very useful skill for these things.

您是否对您的程序进行了概要分析?对于这些事情,分析是一项非常有用的技能。

Source (mostly): https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html

来源(大部分):https: //gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html

回答by Ruslan Gerasimov

You are getting a theoretical background about the issue and it leaves enough space to guess what you are getting in a real run. It is said that the option is not always increasing performance because it depends on a variety of factors, for instance on the loop implementation, its load/body and others.

您正在获得有关该问题的理论背景,并留出足够的空间来猜测您在实际运行中会得到什么。据说该选项并不总是提高性能,因为它取决于多种因素,例如循环实现、其负载/主体等。

Each code is different and if you are interested in finding the better performance solution it is good idea just to run both variants, measure theirs execution times and compare.

每个代码都是不同的,如果您有兴趣找到更好的性能解决方案,最好只运行两个变体,测量它们的执行时间并进行比较。

Look at thisapproach in the answer below to have an idea of time measurement. In two words, you just wrap your code into the cycle which will lead your program running to take several seconds. As you are optimizing loops themselves it is good idea to write a shell script, which runs your app many times.

在下面的答案中查看方法以了解时间测量。简而言之,您只需将代码包装到循环中,这将导致您的程序运行需要几秒钟。当您优化循环本身时,最好编写一个 shell 脚本,它会多次运行您的应用程序。