C++ 为什么将 0.1f 更改为 0 会使性能降低 10 倍?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9314534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 12:39:23  来源:igfitidea点击:

Why does changing 0.1f to 0 slow down performance by 10x?

c++performancevisual-studio-2010compilationfloating-point

提问by Dragarro

Why does this bit of code,

为什么会有这么一段代码,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?

运行速度比以下位快 10 倍以上(相同,除非另有说明)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. The optimization level was -02with sse2enabled. I haven't tested with other compilers.

使用 Visual Studio 2010 SP1 编译时。优化水平-02sse2启用。我还没有用其他编译器测试过。

回答by Mysticial

Welcome to the world of denormalized floating-point!They can wreak havoc on performance!!!

欢迎来到非规范化浮点的世界!他们可以对性能造成严重破坏!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slowerthan on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

非正规(或次正规)数字是一种从浮点表示中获得一些非常接近零的额外值的技巧。非规范化浮点运算可能比规范化浮点运算慢数十到数百倍。这是因为许多处理器无法直接处理它们,必须使用微码捕获和解析它们。

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0or 0.1is used.

如果您在 10,000 次迭代后打印出这些数字,您会看到它们已经收敛到不同的值,具体取决于是否使用00.1

Here's the test code compiled on x64:

这是在 x64 上编译的测试代码:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

输出:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

请注意,在第二次运行中,数字非常接近于零。

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.

非规范化数字通常很少见,因此大多数处理器不会尝试有效地处理它们。



To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zeroby adding this to the start of the code:

为了证明这与非规范化数字有关,如果我们通过将其添加到代码的开头将非规范化数刷新为零

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

然后版本0不再慢 10 倍,实际上变得更快。(这要求在启用 SSE 的情况下编译代码。)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

这意味着我们不是使用这些奇怪的低精度几乎为零的值,而是将其舍入为零。

Timings: Core i7 920 @ 3.5 GHz:

时序:Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0or 0.1fis converted/stored into a register outside of both loops. So that has no effect on performance.

最后,这真的与它是整数还是浮点数无关。的00.1f转换/存储到两个环路的一个寄存器外。所以这对性能没有影响。

回答by mvds

Using gccand applying a diff to the generated assembly yields only this difference:

gcc对生成的程序集使用和应用差异只会产生这种差异:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq 
0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms
, %rcx > cvtsi2ssq %rcx, %xmm1 81d76 < subss %xmm1, %xmm0

The cvtsi2ssqone being 10 times slower indeed.

那个cvtsi2ssq确实慢了 10 倍。

Apparently, the floatversion uses an XMMregister loaded from memory, while the intversion converts a real intvalue 0 to floatusing the cvtsi2ssqinstruction, taking a lot of time. Passing -O3to gcc doesn't help. (gcc version 4.2.1.)

显然,该float版本使用从内存加载的XMM寄存器,而该int版本将实际int值 0转换为float使用该cvtsi2ssq指令,需要很多时间。传递-O3给 gcc 没有帮助。(gcc 版本 4.2.1。)

(Using doubleinstead of floatdoesn't matter, except that it changes the cvtsi2ssqinto a cvtsi2sdq.)

(使用doublefloat不是无关紧要,除了它将 the 更改cvtsi2ssqcvtsi2sdq.)

Update

更新

Some extra tests show that it is not necessarily the cvtsi2ssqinstruction. Once eliminated (using a int ai=0;float a=ai;and using ainstead of 0), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.

一些额外的测试表明它不一定是cvtsi2ssq指令。一旦消除(使用 aint ai=0;float a=ai;和使用a代替0),速度差异仍然存在。所以@Mysticial 是对的,非规范化的浮点数有所不同。这可以通过测试0和之间的值看出0.1f。上面代码中的转折点大约在0.00000000000000000000000000000001,此时循环突然需要 10 倍的时间。

Update<<1

更新<<1

A small visualisation of this interesting phenomenon:

这个有趣现象的一个小可视化:

  • Column 1: a float, divided by 2 for every iteration
  • Column 2: the binary representation of this float
  • Column 3: the time taken to sum this float 1e7 times
  • 第 1 列:浮点数,每次迭代除以 2
  • 第 2 列:此浮点数的二进制表示
  • 第 3 列:将这个浮点数相加 1e7 次所花费的时间

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.

当非规范化开始时,您可以清楚地看到指数(最后 9 位)变为最低值。此时,简单的加法会慢 20 倍。

// Requires #include <fenv.h>
fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);

An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.

关于 ARM 的等效讨论可以在 Stack Overflow 问题Denormalized floating point in Objective-C 中找到?.

回答by fig

It's due to denormalized floating-point use. How to get rid of both it and the performance penalty? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet. I have found these three methods that may work best in different environments:

这是由于非规范化的浮点使用。如何摆脱它和性能损失?在互联网上搜索了杀死非正规数的方法后,似乎还没有“最佳”方法可以做到这一点。我发现这三种方法可能在不同的环境中效果最好:

  • Might not work in some GCC environments:

    // Requires #include <xmmintrin.h>
    _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
    // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
    // You might also want to use the underflow mask (1<<11)
    
  • Might not work in some Visual Studio environments: 1

    // Requires #include <xmmintrin.h>
    // Requires #include <pmmintrin.h>
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
    
  • Appears to work in both GCC and Visual Studio:

    // Requires #include <fenv.h>
    fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
    
  • The Intel compiler has options to disable denormals by default on modern Intel CPUs. More details here

  • Compiler switches. -ffast-math, -msseor -mfpmath=ssewill disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. Test carefully! The equivalent of fast-math for the Visual Studio compiler is /fp:fastbut I haven't been able to confirm whether this also disables denormals.1

  • 在某些 GCC 环境中可能不起作用:

    // Requires #include <xmmintrin.h>
    _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) );
    // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both.
    // You might also want to use the underflow mask (1<<11)
    
  • 在某些 Visual Studio 环境中可能不起作用:1

    // Requires #include <xmmintrin.h>
    // Requires #include <pmmintrin.h>
    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
    _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
    
  • 似乎在 GCC 和 Visual Studio 中都可以使用:

    #include <xmmintrin.h>
    
    #define FTZ 1
    #define DAZ 1   
    
    void enableFtzDaz()
    {
        int mxcsr = _mm_getcsr ();
    
        if (FTZ) {
                mxcsr |= (1<<15) | (1<<11);
        }
    
        if (DAZ) {
                mxcsr |= (1<<6);
        }
    
        _mm_setcsr (mxcsr);
    }
    
  • 默认情况下,英特尔编译器具有在现代英特尔 CPU 上禁用非规范化的选项。更多细节在这里

  • 编译器开关。-ffast-math,-msse或者-mfpmath=sse将禁用非规范化并加快其他一些事情,但不幸的是也会做很多其他可能会破坏代码的近似值。仔细测试!Visual Studio 编译器的快速数学等价物是,/fp:fast但我无法确认这是否也禁用了非规范化。1

回答by German Garcia

In gcc you can enable FTZ and DAZ with this:

在 gcc 中,您可以通过以下方式启用 FTZ 和 DAZ:

##代码##

also use gcc switches: -msse -mfpmath=sse

也使用 gcc 开关:-msse -mfpmath=sse

(corresponding credits to Carl Hetherington [1])

(相应的学分来自 Carl Hetherington [1])

[1] http://carlh.net/plugins/denormals.php

[1] http://carlh.net/plugins/denormals.php

回答by remcycles

Dan Neely's commentought to be expanded into an answer:

Dan Neely 的评论应该扩展为一个答案:

It is not the zero constant 0.0fthat is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop. As they come closer and closer to zero, they need more precision to represent and they become denormalized. These are the y[i]values. (They approach zero because x[i]/z[i]is less than 1.0 for all i.)

不是0.0f非规范化或导致减速的零常数,而是每次循环迭代接近零的值。随着它们越来越接近于零,它们需要更高的精度来表示,并且它们变得非规范化。这些就是y[i]价值观。(它们接近于零,因为x[i]/z[i]所有 都小于 1.0 i。)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;. As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed. Afterwards, floating point operations on y[i]remain fast because they aren't denormalized.

代码的慢速版本和快速版本之间的关键区别在于语句y[i] = y[i] + 0.1f;。只要在循环的每次迭代中执行此行,浮点数中的额外精度就会丢失,并且不再需要表示该精度所需的非规范化。之后,浮点运算y[i]保持快速,因为它们没有被非规范化。

Why is the extra precision lost when you add 0.1f? Because floating point numbers only have so many significant digits. Say you have enough storage for three significant digits, then 0.00001 = 1e-5, and 0.00001 + 0.1 = 0.1, at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001.

为什么添加时会丢失额外的精度0.1f?因为浮点数只有这么多有效数字。假设您有足够的存储空间来存储三位有效数字,然后0.00001 = 1e-5, 和0.00001 + 0.1 = 0.1,至少对于此示例浮点格式,因为它没有空间存储0.10001.

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;isn't the no-op you might think it is.

简而言之,y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;这不是您可能认为的无操作。

Mystical said this as well: the content of the floats matters, not just the assembly code.

Mystical 也这么说:浮点数的内容很重要,而不仅仅是汇编代码。