C++ 是否可以推出明显更快的 sqrt 版本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2637700/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 00:10:53  来源:igfitidea点击:

Is it possible to roll a significantly faster version of sqrt

c++optimizationsqrt

提问by Mr. Boy

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

在我分析的应用程序中,我发现在某些情况下,此函数能够占用总执行时间的 10% 以上。

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

多年来,我已经看到有关使用偷偷摸摸的浮点技巧实现更快 sqrt 的讨论,但我不知道这些东西在现代 CPU 上是否已经过时。

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

正在使用 MSVC++ 2008 编译器,仅供参考……尽管我认为 sqrt 不会增加太多开销。

See also here for similar discussion on modffunction.

有关modf函数的类似讨论,另请参见此处。

EDIT: for reference, thisis one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

编辑:作为参考,是一种广泛使用的方法,但它实际上要快得多吗?现在 SQRT 到底有多少个周期?

采纳答案by James

Yes, it is possible even without trickery:

是的,即使没有诡计也是可能的:

1) sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.

1)为了速度牺牲精度:sqrt算法是迭代的,用更少的迭代重新实现。

2) lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.

2) 查找表:要么仅用于迭代的起点,要么与插值相结合,让您一路走到那里。

3) caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.

3)缓存:您是否总是使用相同的有限值集?如果是这样,缓存可以很好地工作。我发现这在图形应用程序中很有用,在这些应用程序中,为许多相同大小的形状计算相同的东西,因此可以有效地缓存结果。

回答by celion

There's a great comparison table here: http://assemblyrequired.crashworks.org/timing-square-root/

这里有一个很好的比较表:http: //assemblyrequired.crashworks.org/timing-square-root/

Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).

长话短说,SSE2 的 ssqrts 比 FPU fsqrt 快大约 2 倍,近似 + 迭代比它快大约 4 倍(总共 8 倍)。

Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

另外,如果您尝试使用单精度 sqrt,请确保这实际上是您得到的。我听说至少有一个编译器可以将 float 参数转换为 double,调用 double-precision sqrt,然后转换回 float。

回答by sbi

You're very likely to gain more speed improvements by changing your algorithmsthan by changing their implementations: Try to call sqrt()less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt()you mention are just that: improvements of the algorithmused to calculate a square root.)

通过改变算法而不是改变它们的实现,你很可能会获得更多的速度改进:尽量减少调用sqrt()而不是更快地调用。(如果您认为这是不可能的 -sqrt()您提到的改进就是:用于计算平方根的算法的改进。)

Since it is used very often, it is likely that your standard library's implementation of sqrt()is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.

由于它经常使用,因此您的标准库的实现可能sqrt()对于一般情况几乎是最佳的。除非您有一个限制域(例如,如果您需要较低的精度),算法可以采取一些捷径,否则不太可能有人提出更快的实现。

Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

请注意,由于该函数使用了 10% 的执行时间,即使您设法提出只需要 75% 时间的实现std::sqrt(),这仍然只会使您的执行时间减少2.5%。对于大多数应用程序,用户甚至不会注意到这一点,除非他们使用手表进行测量。

回答by jemfinch

How accurate do you need your sqrtto be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square rootfunction for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

你需要有多准确sqrt?您可以非常快速地获得合理的近似值:请参阅 Quake3 出色的平方根反函数以获取灵感(请注意,代码是 GPL 的,因此您可能不想直接集成它)。

回答by will

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrtfunction with an inline assembly version;

不知道你是否解决了这个问题,但我以前读过它,似乎最快的方法是用sqrt内联汇编版本替换该函数;

you can see a description of a load of alternatives here.

你可以在这里看到大量替代品的描述。

The best is this snippet of magic:

最好的是这个魔法片段:

double inline __declspec (naked) __fastcall sqrt(double n)
{
    _asm fld qword ptr [esp+4]
    _asm fsqrt
    _asm ret 8
} 

It's about 4.7x faster than the standard sqrtcall with the same precision.

它比sqrt具有相同精度的标准调用快 4.7 倍。

回答by DanielHsH

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()

这是一个查找表只有 8KB 的快速方法。错误大约是结果的 0.5%。您可以轻松地放大表格,从而减少错误。运行速度比常规 sqrt() 快约 5 倍

// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11;                       // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory. 
const int nUnusedBits   = 23 - nBitsForSQRTprecision;       // Amount of bits we will disregard
const int tableSize     = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize]; 
static unsigned char is_sqrttab_initialized = FALSE;        // Once initialized will be true

// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
    unsigned short i;
    float f;
    UINT32 *fi = (UINT32*)&f;

    if (is_sqrttab_initialized)
        return;

    const int halfTableSize = (tableSize>>1);
    for (i=0; i < halfTableSize; i++){
         *fi = 0;
         *fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127

         // Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
         f = sqrtf(f);
         sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);

         // Repeat the process, this time with an exponent of 1, stored as 128
         *fi = 0;
         *fi = (i << nUnusedBits) | (128 << 23);
         f = sqrtf(f);
         sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
    }
    is_sqrttab_initialized = TRUE;
}

// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
    if (n <= 0.f) 
        return 0.f;                           // On 0 or negative return 0.
    UINT32 *num = (UINT32*)&n;
    short e;                                  // Exponent
    e = (*num >> 23) - 127;                   // In 'float' the exponent is stored with 127 added.
    *num &= 0x7fffff;                         // leave only the mantissa 

    // If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
    const int halfTableSize = (tableSize>>1);
    const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
    if (e & 0x01) 
        *num |= secondHalphTableIdBit;  
    e >>= 1;                                  // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands

    // Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
    *num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
    return n;
}