C++ 如果我针对大小而不是速度进行优化,为什么 GCC 生成的代码速度会快 15-20%?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19470873/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:52:00  来源:igfitidea点击:

Why does GCC generate 15-20% faster code if I optimize for size instead of speed?

c++performancegccx86-64compiler-optimization

提问by Ali

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size(-Os) instead of speed (-O2or -O3), and I have been wondering ever since why.

我在 2009 年第一次注意到 GCC(至少在我的项目和我的机器上)如果我针对大小( -Os) 而不是速度 ( -O2or -O3)进行优化,倾向于生成明显更快的代码,从那时起我一直在想为什么。

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

我已经设法创建(相当愚蠢的)代码来显示这种令人惊讶的行为,并且足够小,可以在此处发布。

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

如果我用 编译它-Os,执行这个程序需要 0.38 秒,如果用-O2或编译它需要 0.44 秒-O3。这些时间是一致获得的,几乎没有噪音(gcc 4.7.2、x86_64 GNU/Linux、英特尔酷睿 i5-3320M)。

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-*flags have the same effect.)

(更新:我已将所有汇编代码移至GitHub:他们使帖子变得臃肿,并且显然对问题的价值很小,因为fno-align-*标志具有相同的效果。)

Here is the generated assembly with -Osand -O2.

这是生成的程序集-Os-O2

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2and merged all its differences into the assembly for -Osexceptthe .p2alignlines, result here. This code still runs in 0.38s and the only difference is the.p2alignstuff.

不幸的是,我的组装理解是非常有限的,所以我不知道是否我所做的未来是正确的:我抓住了大会-O2并合并其所有分歧入大会-Os除了.p2align线,结果在这里。这段代码仍然在 0.38 秒内运行,唯一的区别是.p2align内容。

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs?it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

如果我猜对了,这些是用于堆栈对齐的填充。根据为什么 GCC pad 与 NOPs 一起工作?这样做是希望代码运行得更快,但显然这种优化在我的情况下适得其反。

Is it the padding that is the culprit in this case? Why and how?

在这种情况下,填充是罪魁祸首吗?为什么以及如何?

The noise it makes pretty much makes timing micro-optimizations impossible.

它产生的噪音几乎使时序微优化变得不可能。

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?

当我对 C 或 C++ 源代码进行微优化(与堆栈对齐无关)时,如何确保这种偶然的幸运/不幸对齐不会干扰?



UPDATE:

更新:

Following Pascal Cuoq's answerI tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loopsto gcc, all .p2alignare gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

按照Pascal Cuoq 的回答,我对对齐方式进行了一些修改。通过传递-O2 -fno-align-functions -fno-align-loops给 gcc,所有内容.p2align都从程序集中消失,生成的可执行文件在 0.38 秒内运行。根据gcc 文档

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:

  -falign-functions  -falign-jumps  -falign-loops
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
  -fprefetch-loop-arrays

-Os 启用所有 -O2 优化 [但] -Os 禁用以下优化标志:

  -falign-functions  -falign-jumps  -falign-loops
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
  -fprefetch-loop-arrays

So, it pretty much seems like a (mis)alignment issue.

所以,这似乎是一个(错误)对齐问题。

I am still skeptical about -march=nativeas suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)

我仍然怀疑Marat Dukhan 的回答中-march=native所建议。我不相信它不仅仅是在干扰这个(错误)对齐问题;它对我的机器完全没有影响。(尽管如此,我赞成他的回答。)



UPDATE 2:

更新 2:

We can take -Osout of the picture.The following times are obtained by compiling with

我们可以把-Os照片拿出来。以下时间是通过编译得到的

  • -O2 -fno-omit-frame-pointer0.37s

  • -O2 -fno-align-functions -fno-align-loops0.37s

  • -S -O2then manually moving the assembly of add()after work()0.37s

  • -O20.44s

  • -O2 -fno-omit-frame-pointer0.37s

  • -O2 -fno-align-functions -fno-align-loops0.37s

  • -S -O2然后add()work()0.37 秒后手动移动组件

  • -O20.44s

It looks like to me the distance of add()from the call site matters a lot. I have tried perf, but the output of perf statand perf reportmakes very little sense to me. However, I could only get one consistent result out of it:

在我看来add(),与呼叫站点的距离很重要。我已经试过perf,但输出perf statperf report让人很没有意义了我。但是,我只能从中得到一个一致的结果:

-O2:

-O2

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |       return x + y;
100.00 |     lea    (%rdi,%rsi,1),%eax
       |   }
       |   ? retq
[...]
       |            int z = add(x, y);
  1.93 |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 79.79 |      add    %eax,%ebx

For fno-align-*:

对于fno-align-*

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |       return x + y;
 51.59 |     lea    (%rdi,%rsi,1),%eax
       |   }
[...]
       |    __attribute__((noinline))
       |    static int work(int xval, int yval) {
       |        int sum(0);
       |        for (int i=0; i<LOOP_BOUND; ++i) {
       |            int x(xval+sum);
  8.20 |      lea    0x0(%r13,%rbx,1),%edi
       |            int y(yval+sum);
       |            int z = add(x, y);
 35.34 |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 39.48 |      add    %eax,%ebx
       |    }

For -fno-omit-frame-pointer:

对于-fno-omit-frame-pointer

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     |
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
 18.67 |     push   %rbp
       |       return x + y;
 18.49 |     lea    (%rdi,%rsi,1),%eax
       |   const int LOOP_BOUND = 200000000;
       |
       |   __attribute__((noinline))
       |   static int add(const int& x, const int& y) {
       |     mov    %rsp,%rbp
       |       return x + y;
       |   }
 12.71 |     pop    %rbp
       |   ? retq
 [...]
       |            int z = add(x, y);
       |    ? callq  add(int const&, int const&) [clone .isra.0]
       |            sum += z;
 29.83 |      add    %eax,%ebx

It looks like we are stalling on the call to add()in the slow case.

add()在缓慢的情况下,看起来我们正在拖延调用。

I have examined everythingthat perf -ecan spit out on my machine; not just the stats that are given above.

我已经研究过的一切perf -e可以吐出我的机器上; 不仅仅是上面给出的统计数据。

For the same executable, the stalled-cycles-frontendshows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontendfor different executables doesn't make sense to me.)

对于同一个可执行文件,stalled-cycles-frontend显示与执行时间线性相关;我没有注意到任何其他相关性如此明显的东西。(比较stalled-cycles-frontend不同的可执行文件对我来说没有意义。)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

我包括缓存未命中,因为它作为第一条评论出现。我检查了可以在我的机器上测量的所有缓存未命中perf,而不仅仅是上面给出的那些。缓存未命中非常嘈杂,并且与执行时间几乎没有相关性。

采纳答案by Ali

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).

我的同事帮助我为我的问题找到了一个合理的答案。他注意到 256 字节边界的重要性。他没有在这里注册并鼓励我自己发布答案(并获得所有名气)。



Short answer:

简短的回答:

Is it the padding that is the culprit in this case? Why and how?

在这种情况下,填充是罪魁祸首吗?为什么以及如何?

It all boils down to alignment.Alignments can have a significant impact on the performance, that is why we have the -falign-*flags in the first place.

这一切都归结为对齐。对齐会对性能产生重大影响,这就是我们首先拥有-falign-*标志的原因。

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes."Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3does the appropriate alignment and the generated code does not show this weird behavior.

我已向gcc 开发人员提交了(伪造的?)错误报告。事实证明,默认行为是默认情况下我们将循环对齐到 8 字节,但如果我们不需要填充超过 10 个字节,则尝试将其对齐到 16 字节。” 显然,在这种特殊情况下和我的机器上,此默认值不是最佳选择。Clang 3.4 (trunk) with-O3做了适当的对齐并且生成的代码没有显示这种奇怪的行为。

Of course, if an inappropriate alignment is done, it makes things worse.An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

当然,如果进行了不恰当的对齐,事情就会变得更糟。不必要的/错误的对齐只会无缘无故地占用字节,并可能增加缓存未命中等。

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?

它产生的噪音几乎使时序微优化变得不可能。

当我对 C 或 C++ 源代码进行微优化(与堆栈对齐无关)时,如何确保这种偶然的幸运/不幸对齐不会干扰?

Simply by telling gcc to do the right alignment:

只需告诉 gcc 进行正确的对齐:

g++ -O2 -falign-functions=16 -falign-loops=16

g++ -O2 -falign-functions=16 -falign-loops=16



Long answer:

长答案:

The code will run slower if:

如果出现以下情况,代码将运行得更慢:

  • an XXbyte boundary cuts add()in the middle (XXbeing machine dependent).

  • if the call to add()has to jump over an XXbyte boundary and the target is not aligned.

  • if add()is not aligned.

  • if the loop is not aligned.

  • 中间有一个XX字节边界add()XX取决于机器)。

  • 如果调用add()必须跳过XX字节边界并且目标未对齐。

  • 如果 add()没有对齐。

  • 如果循环未对齐。

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os(executes in 0.994 secs):

前 2 个在Marat Dukhan 好心张贴的代码和结果中清晰可见。在这种情况下,gcc-4.8.1 -Os(在 0.994 秒内执行):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3   

a 256 byte boundary cuts add()right in the middle and neither add()nor the loop is aligned. Surprise, surprise, this is the slowest case!

一个 256 字节的边界add()在中间切开add(),循环也没有对齐。惊喜,惊喜,这是最慢的情况!

In case gcc-4.7.3 -Os(executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add()is cut):

在情况下gcc-4.7.3 -Os(在 0.822 秒内执行),256 字节边界仅切入冷段(但既不循环,也不add()切):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add()has to jump over the 256 byte boundary. This code is the second slowest.

没有对齐,并且调用add()必须跳过 256 字节边界。这段代码是第二慢的。

In case gcc-4.6.4 -Os(executes in 0.709 secs), although nothing is aligned, the call to add()doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

万一gcc-4.6.4 -Os(在 0.709 秒内执行),虽然没有对齐,但对 的调用add()不必跳过 256 字节边界,目标正好在 32 字节之外:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

这是所有三个中最快的。为什么 256 字节的边界在他的机器上是特别的,我会让他自己去弄清楚。我没有这样的处理器。

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointerflag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32or any multiples of 16, the code is not sensitive to that either.

现在,在我的机器上,我没有得到这个 256 字节的边界效果。只有函数和循环对齐在我的机器上起作用。如果我通过了,g++ -O2 -falign-functions=16 -falign-loops=16那么一切都会恢复正常:我总是得到最快的情况,并且时间不再对-fno-omit-frame-pointer标志敏感。我可以通过g++ -O2 -falign-functions=32 -falign-loops=32或任何 16 的倍数,代码对此也不敏感。

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.

我在 2009 年第一次注意到 gcc(至少在我的项目和我的机器上)如果我优化大小(-Os)而不是速度(-O2 或 -O3),则倾向于生成明显更快的代码,我一直在想自从为什么。

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Osinstead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better.From now on, I will check the effects of alignment on my projects.

一个可能的解释是我有热点对对齐很敏感,就像这个例子中的那样。通过混淆标志(传递-Os而不是-O2),这些热点偶然以幸运的方式对齐,代码变得更快。这与优化大小无关:这些完全是偶然的,热点对齐得更好。从现在开始,我将检查对齐对我的项目的影响。

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add()fail?

哦,还有一件事。这样的热点是如何产生的,就像示例中所示的那样?内联这么小的函数怎么会add()失败呢?

Consider this:

考虑一下:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

and in a separate file:

并在一个单独的文件中:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

and compiled as: g++ -O2 add.cpp main.cpp.

并编译:g++ -O2 add.cpp main.cpp

      gcc won't inline add()!

      gcc 不会内联add()

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler.If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

就是这样,很容易在无意中创建像 OP 中那样的热点。当然,部分原因是我的错:gcc 是一个出色的编译器。如果将上面编译为:g++ -O2 -flto add.cpp main.cpp,也就是如果我进行链接时间优化,代码运行时间为0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

(内联在 OP 中被人为禁用,因此,OP 中的代码慢了 2 倍)。

回答by Marat Dukhan

By default compilers optimize for "average" processor. Since different processors favor different instruction sequences, compiler optimizations enabled by -O2might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os). If you try the same example on different processors, you will find that on some of them benefit from -O2while other are more favorable to -Osoptimizations.

默认情况下,编译器针对“平均”处理器进行优化。由于不同的处理器支持不同的指令序列,由 启用的编译器优化-O2可能会使普通处理器受益,但会降低特定处理器的性能(同样适用于-Os)。如果您在不同的处理器上尝试相同的示例,您会发现其中一些受益于优化,-O2而另一些则更有利于-Os优化。

Here are the results for time ./test 0 0on several processors (user time reported):

以下是time ./test 0 0多个处理器的结果(报告的用户时间):

Processor (System-on-Chip)             Compiler   Time (-O2)  Time (-Os)  Fastest
AMD Opteron 8350                       gcc-4.8.1    0.704s      0.896s      -O2
AMD FX-6300                            gcc-4.8.1    0.392s      0.340s      -Os
AMD E2-1800                            gcc-4.7.2    0.740s      0.832s      -O2
Intel Xeon E5405                       gcc-4.8.1    0.603s      0.804s      -O2
Intel Xeon E5-2603                     gcc-4.4.7    1.121s      1.122s       -
Intel Core i3-3217U                    gcc-4.6.4    0.709s      0.709s       -
Intel Core i3-3217U                    gcc-4.7.3    0.708s      0.822s      -O2
Intel Core i3-3217U                    gcc-4.8.1    0.708s      0.944s      -O2
Intel Core i7-4770K                    gcc-4.8.1    0.296s      0.288s      -Os
Intel Atom 330                         gcc-4.8.1    2.003s      2.007s      -O2
ARM 1176JZF-S (Broadcom BCM2835)       gcc-4.6.3    3.470s      3.480s      -O2
ARM Cortex-A8 (TI OMAP DM3730)         gcc-4.6.3    2.727s      2.727s       -
ARM Cortex-A9 (TI OMAP 4460)           gcc-4.6.3    1.648s      1.648s       -
ARM Cortex-A9 (Samsung Exynos 4412)    gcc-4.6.3    1.250s      1.250s       -
ARM Cortex-A15 (Samsung Exynos 5250)   gcc-4.7.2    0.700s      0.700s       -
Qualcomm Snapdragon APQ8060A           gcc-4.8       1.53s       1.52s      -Os

In some cases you can alleviate the effect of disadvantageous optimizations by asking gccto optimize for your particular processor (using options -mtune=nativeor -march=native):

在某些情况下,您可以通过要求gcc针对您的特定处理器进行优化(使用选项-mtune=native-march=native)来减轻不利优化的影响:

Processor            Compiler   Time (-O2 -mtune=native) Time (-Os -mtune=native)
AMD FX-6300          gcc-4.8.1         0.340s                   0.340s
AMD E2-1800          gcc-4.7.2         0.740s                   0.832s
Intel Xeon E5405     gcc-4.8.1         0.603s                   0.803s
Intel Core i7-4770K  gcc-4.8.1         0.296s                   0.288s

Update: on Ivy Bridge-based Core i3 three versions of gcc(4.6.4, 4.7.3, and 4.8.1) produce binaries with significantly different performance, but the assembly code has only subtle variations. So far, I have no explanation of this fact.

更新:在基于 Ivy Bridge 的 Core i3 上,gcc( 4.6.44.7.34.8.1) 的三个版本产生性能明显不同的二进制文件,但汇编代码只有细微的变化。到目前为止,我还没有解释这个事实。

Assembly from gcc-4.6.4 -Os(executes in 0.709 secs):

汇编自gcc-4.6.4 -Os(在 0.709 秒内执行):

00000000004004d2 <_ZL3addRKiS0_.isra.0>:
  4004d2:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004d5:       c3                      ret

00000000004004d6 <_ZL4workii>:
  4004d6:       41 55                   push   r13
  4004d8:       41 89 fd                mov    r13d,edi
  4004db:       41 54                   push   r12
  4004dd:       41 89 f4                mov    r12d,esi
  4004e0:       55                      push   rbp
  4004e1:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  4004e6:       53                      push   rbx
  4004e7:       31 db                   xor    ebx,ebx
  4004e9:       41 8d 34 1c             lea    esi,[r12+rbx*1]
  4004ed:       41 8d 7c 1d 00          lea    edi,[r13+rbx*1+0x0]
  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>
  4004fd:       89 d8                   mov    eax,ebx
  4004ff:       5b                      pop    rbx
  400500:       5d                      pop    rbp
  400501:       41 5c                   pop    r12
  400503:       41 5d                   pop    r13
  400505:       c3                      ret

Assembly from gcc-4.7.3 -Os(executes in 0.822 secs):

汇编自gcc-4.7.3 -Os(在 0.822 秒内执行):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

00000000004004fe <_ZL4workii>:
  4004fe:       41 55                   push   r13
  400500:       41 89 f5                mov    r13d,esi
  400503:       41 54                   push   r12
  400505:       41 89 fc                mov    r12d,edi
  400508:       55                      push   rbp
  400509:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  40050e:       53                      push   rbx
  40050f:       31 db                   xor    ebx,ebx
  400511:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]
  400516:       41 8d 3c 1c             lea    edi,[r12+rbx*1]
  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>
  40051f:       01 c3                   add    ebx,eax
  400521:       ff cd                   dec    ebp
  400523:       75 ec                   jne    400511 <_ZL4workii+0x13>
  400525:       89 d8                   mov    eax,ebx
  400527:       5b                      pop    rbx
  400528:       5d                      pop    rbp
  400529:       41 5c                   pop    r12
  40052b:       41 5d                   pop    r13
  40052d:       c3                      ret

Assembly from gcc-4.8.1 -Os(executes in 0.994 secs):

汇编自gcc-4.8.1 -Os(在 0.994 秒内执行):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3                      ret

0000000000400501 <_ZL4workii>:
  400501:       41 55                   push   r13
  400503:       41 89 f5                mov    r13d,esi
  400506:       41 54                   push   r12
  400508:       41 89 fc                mov    r12d,edi
  40050b:       55                      push   rbp
  40050c:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  400511:       53                      push   rbx
  400512:       31 db                   xor    ebx,ebx
  400514:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]
  400519:       41 8d 3c 1c             lea    edi,[r12+rbx*1]
  40051d:       e8 db ff ff ff          call   4004fd <_ZL3addRKiS0_.isra.0>
  400522:       01 c3                   add    ebx,eax
  400524:       ff cd                   dec    ebp
  400526:       75 ec                   jne    400514 <_ZL4workii+0x13>
  400528:       89 d8                   mov    eax,ebx
  40052a:       5b                      pop    rbx
  40052b:       5d                      pop    rbp
  40052c:       41 5c                   pop    r12
  40052e:       41 5d                   pop    r13
  400530:       c3                      ret

回答by Gene

I'm adding this post-accept to point out that the effects of alignment on overall performance of programs - including big ones - has been studied. For example, this article(and I believe a version of this also appeared in CACM) shows how link order and OS environment size changes alone were sufficient to shift performance significantly. They attribute this to alignment of "hot loops".

我添加这个 post-accept 是为了指出对齐对程序整体性能的影响 - 包括大型程序 - 已经过研究。例如,这篇文章(我相信这个版本也出现在 CACM 中)展示了链接顺序和操作系统环境大小的变化如何足以显着改变性能。他们将此归因于“热循环”的对齐。

This paper, titled "Producing wrong data without doing anything obviously wrong!" says that inadvertent experimental bias due to nearly uncontrollable differences in program running environments probably renders many benchmark results meaningless.

这篇论文的标题是“在没有做任何明显错误的事情的情况下产生错误的数据!” 说由于程序运行环境中几乎无法控制的差异导致的无意的实验偏差可能会使许多基准测试结果变得毫无意义。

I think you're encountering a different angle on the same observation.

我认为您在同一观察中遇到了不同的角度。

For performance-critical code, this is a pretty good argument for systems that assess the environment at installation or run time and choose the local best among differently optimized versions of key routines.

对于性能关键代码,对于在安装或运行时评估环境并在关键例程的不同优化版本中选择本地最佳的系统来说,这是一个很好的论据。

回答by Pascal Cuoq

I think that you can obtain the same result as what you did:

我认为您可以获得与您所做的相同的结果:

I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines:

我抓住了 -O2 的程序集并将其所有差异合并到 -Os 的程序集中,除了 .p2align 行:

… by using -O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1. I have been compiling everything with these options, that were faster than plain -O2everytime I bothered to measure, for 15 years.

...通过使用-O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1. -O215 年来,我一直在用这些选项编译所有内容,每次我费心测量时,这些选项都比普通选项要快。

Also, for a completely different context (including a different compiler), I noticed that the situation is similar: the option that is supposed to “optimize code size rather than speed” optimizes for code size and speed.

此外,对于完全不同的上下文(包括不同的编译器),我注意到情况是相似的:应该“优化代码大小而不是速度”的选项针对代码大小和速度进行了优化。

If I guess correctly, these are paddings for stack alignment.

如果我猜对了,这些是用于堆栈对齐的填充。

No, this has nothing to do with the stack, the NOPs that are generated by default and that options -falign-*=1 prevent are for code alignment.

不,这与堆栈无关,默认情况下生成的 NOP 和选项 -falign-*=1 防止用于代码对齐。

According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

根据为什么 GCC pad 与 NOPs 一起工作?这样做是希望代码运行得更快,但显然这种优化在我的情况下适得其反。

在这种情况下,填充是罪魁祸首吗?为什么以及如何?

It is very likely that the padding is the culprit. The reason padding is felt to be necessary and is useful in some cases is that code is typically fetched in lines of 16 bytes (see Agner Fog's optimization resourcesfor the details, which vary by model of processor). Aligning a function, loop, or label on a 16-bytes boundary means that the chances are statistically increased that one fewer lines will be necessary to contain the function or loop. Obviously, it backfires because these NOPs reduce code density and therefore cache efficiency. In the case of loops and label, the NOPs may even need to be executed once (when execution arrives to the loop/label normally, as opposed to from a jump).

填充很可能是罪魁祸首。填充被认为是必要的并且在某些情况下有用的原因是代码通常以 16 字节的行获取(有关详细信息,请参阅Agner Fog 的优化资源,这些资源因处理器型号而异)。在 16 字节边界上对齐函数、循环或标签意味着在统计上增加包含函数或循环所需的行数减少的机会。显然,它会适得其反,因为这些 NOP 会降低代码密度,从而降低缓存效率。在循环和标签的情况下,NOP 甚至可能需要执行一次(当执行正常到达循环/标签时,与跳转相反)。

回答by Joshua

If your program is bounded by the CODE L1 cache, then optimizing for size suddenly starts to pay out.

如果您的程序受 CODE L1 缓存的限制,那么优化大小会突然开始得到回报。

When last I checked, the compiler is not smart enough to figure this out in all cases.

当我最后一次检查时,编译器不够聪明,无法在所有情况下解决这个问题。

In your case, -O3 probably generates code enough for two cache lines, but -Os fits in one cache line.

在您的情况下, -O3 可能会为两个缓存行生成足够的代码,但 -Os 适合一个缓存行。

回答by Daniel Frey

I'm by no means an expert in this area, but I seem to remember that modern processors are quite sensitive when it comes to branch prediction. The algorithms used to predict the branches are (or at least were back in the days I wrote assembler code) based on several properties of the code, including the distance of a target and on the direction.

我绝不是这方面的专家,但我似乎记得现代处理器在分支预测方面非常敏感。用于预测分支的算法(或者至少在我编写汇编代码的时代)基于代码的几个属性,包括目标的距离和方向。

The scenario which comes to mind is small loops. When the branch was going backwards and the distance was not too far, the branch predicition was optimizing for this case as all the small loops are done this way. The same rules might come into play when you swap the location of addand workin the generated code or when the position of both slightly changes.

想到的场景是小循环。当分支向后移动并且距离不太远时,分支预测正在针对这种情况进行优化,因为所有小循环都是以这种方式完成的。当你交换的位置大致相同的规则可能会发挥作用add,并work在生成的代码或当两个位置稍有改变。

That said, I have no idea how to verify that and I just wanted to let you know that this might be something you want to look into.

也就是说,我不知道如何验证这一点,我只是想让你知道这可能是你想要研究的东西。