用 C++ 最快实现正弦、余弦和平方根(不需要太准确)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18662261/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 22:06:34  来源:igfitidea点击:

Fastest implementation of sine, cosine and square root in C++ (doesn't need to be much accurate)

c++mathoptimizationtrigonometry

提问by PiotrK

I am googling the question for past hour, but there are only points to Taylor Series or some sample code that is either too slow or does not compile at all. Well, most answer I've found over Google is "Google it, it's already asked", but sadly it's not...

我在谷歌上搜索了过去一个小时的问题,但只有泰勒级数或一些示例代码太慢或根本无法编译。好吧,我在谷歌上找到的大多数答案是“谷歌它,它已经被问过了”,但遗憾的是它不是......

I am profiling my game on low-end Pentium 4 and found out that ~85% of execution time is wasted on calculating sinus, cosinus and square root (from standard C++ library in Visual Studio), and this seems to be heavily CPU dependent (on my I7 the same functions got only 5% of execution time, and the game is waaaaaaaaaayfaster). I cannot optimize this three functions out, nor calculate both sine and cosine in one pass (there interdependent), but I don't need too accurate results for my simulation, so I can live with faster approximation.

我正在低端 Pentium 4 上分析我的游戏,发现大约 85% 的执行时间浪费在计算正弦、余弦和平方根(来自 Visual Studio 中的标准 C++ 库)上,这似乎严重依赖于 CPU(在我的I7同样功能只得到了5%的执行时间,而游戏是waaaaaaaaaay更快)。我无法优化这三个函数,也无法一次性计算正弦和余弦(相互依赖),但我的模拟不需要太准确的结果,因此我可以接受更快的近似值。

So, the question: What are the fastest way to calculate sine, cosine and square root for float in C++?

那么,问题是:在 C++ 中计算浮点数的正弦、余弦和平方根的最快方法是什么?

EDITLookup table are more painful as resulting Cache Miss is way more costly on modern CPU than Taylor Series. The CPUs are just so fast these days, and cache is not.

编辑查找表更痛苦,因为在现代 CPU 上产生的缓存未命中比泰勒级数要昂贵得多。这些天 CPU 的速度是如此之快,而缓存则不然。

I made a mistake, I though that I need to calculate several factorials for Taylor Series, and I see now they can be implemented as constants.

我犯了一个错误,我认为我需要为泰勒级数计算几个阶乘,现在我看到它们可以作为常量实现。

So the updated question: is there any speedy optimization for square root as well?

所以更新的问题是:平方根是否也有任何快速优化?

EDIT2

编辑2

I am using square root to calculate distance, not normalization - can't use fast inverse square root algorithm (as pointed in comment: http://en.wikipedia.org/wiki/Fast_inverse_square_root

我使用平方根来计算距离,而不是归一化 - 不能使用快速逆平方根算法(如评论中指出的那样:http: //en.wikipedia.org/wiki/Fast_inverse_square_root

EDIT3

编辑3

I also cannot operate on squared distances, I need exact distance for calculations

我也不能对平方距离进行运算,我需要精确的距离进行计算

采纳答案by KeatsKelleher

The fastest way is to pre-compute values an use a table like in this example:

最快的方法是预先计算值并使用如下例中的表格:

Create sine lookup table in C++

在 C++ 中创建正弦查找表

BUT if you insist upon computing at runtime you can use the Taylor series expansion of sine or cosine...

但是如果你坚持在运行时计算,你可以使用正弦或余弦的泰勒级数展开......

Taylor Series of sine

泰勒级数正弦

For more on the Taylor series... http://en.wikipedia.org/wiki/Taylor_series

有关泰勒级数的更多信息... http://en.wikipedia.org/wiki/Taylor_series

One of the keys to getting this to work well is pre-computing the factorials and truncating at a sensible number of terms. The factorials grow in the denominator very quickly, so you don't need to carry more than a few terms.

使其正常工作的关键之一是预先计算阶乘并截断合理数量的项。阶乘在分母中的增长非常快,因此您不需要携带多个项。

Also...Don't multiply your x^n from the start each time...e.g. multiply x^3 by x another two times, then that by another two to compute the exponents.

另外...不要每次都从一开始就将 x^n 相乘...例如,将 x^3 再乘以 x 两次,然后再乘以两次以计算指数。

回答by Vivian Miranda

First, the Taylor series is NOT the best/fastest way to implement sine/cos. It is also not the way professional libraries implement these trigonometric functions, and knowing the best numerical implementation allows you to tweak the accuracy to get speed more efficiently. In addition, this problem has already been extensively discussed in StackOverflow. Here is just one example.

首先,泰勒级数不是实现正弦/余弦的最佳/最快方法。这也不是专业库实现这些三角函数的方式,了解最佳数值实现可以让您调整精度以更有效地提高速度。此外,这个问题已经在 StackOverflow 中进行了广泛的讨论。这里只是一个例子

Second, the big difference you see between old/new PCS is due to the fact that modern Intel architecture has explicit assembly code to calculate elementary trigonometric functions. It is quite hard to beat them on execution speed.

其次,您在新旧 PCS 之间看到的巨大差异是由于现代英特尔架构具有显式汇编代码来计算基本三角函数。在执行速度上很难击败他们。

Finally, let's talk about the code on your old PC. Check gsl gnu scientific library(or numerical recipes) implementation, and you will see that they basically use Chebyshev Approximation Formula.

最后,让我们谈谈旧 PC 上的代码。检查gsl gnu 科学库(或数值配方)的实现,您会发现它们基本上使用切比雪夫近似公式。

Chebyshev approximation converges faster, so you need to evaluate fewer terms. I won't write implementation details here because there are already very nice answers posted on StackOverflow. Check this one for example. Just tweak the number of terms on this series to change the balance between accuracy/speed.

Chebyshev 近似收敛更快,因此您需要评估的项更少。我不会在这里写实现细节,因为 StackOverflow 上已经发布了非常好的答案。例如,检查这个。只需调整该系列的术语数量即可改变准确性/速度之间的平衡。

For this kind of problem, if you want implementation details of some special function or numerical method, you should take a look on GSL code before any further action - GSL is THE STANDARD numerical library.

对于这类问题,如果你想要一些特殊函数或数值方法的实现细节,你应该在进一步操作之前查看 GSL 代码 - GSL 是标准数值库。

EDIT: you may improve the execution time by including aggressive floating point optimization flags in gcc/icc. This will decrease the precision, but it seems that is exactly what you want.

编辑:您可以通过在 gcc/icc 中包含积极的浮点优化标志来改善执行时间。这会降低精度,但这似乎正是您想要的。

EDIT2: You can try to make a coarse sin grid and use gsl routine (gsl_interp_cspline_periodic for spline with periodic conditions) to spline that table (the spline will reduce the errors in comparison to a linear interpolation => you need less points on your table => less cache miss)!

EDIT2:您可以尝试制作一个粗糙的正弦网格并使用 gsl 例程(gsl_interp_cspline_periodic 用于具有周期性条件的样条曲线)对该表进行样条曲线(与线性插值相比,样条曲线将减少误差 => 表上需要更少的点 = > 更少的缓存未命中)!

回答by johnwbyrd

Here's the guaranteed fastest possible sine function in C++:

这是 C++ 中保证最快的正弦函数:

double FastSin(double x)
{
    return 0;
}

Oh, you wanted better accuracy than |1.0|? Well, here is a sine function that is similarly fast:

哦,您想要比 |1.0| 更高的准确度吗?好吧,这是一个同样快的正弦函数:

double FastSin(double x)
{
    return x;
}

This answer actually does not suck, when x is close to zero. For small x, sin(x) is approximately equal to x.

当 x 接近于零时,这个答案实际上并不糟糕对于较小的 x, sin(x) 大约等于 x

What, still not accurate enough for you? Well read on.

什么,对你来说还不够准确?好吧,继续阅读。

Engineers in the 1970s made some fantastic discoveries in this field, but new programmers are simply unaware that these methods exist, because they're not taught as part of standard computer science curricula.

1970 年代的工程师在该领域取得了一些奇妙的发现,但新程序员根本不知道这些方法的存在,因为它们不是作为标准计算机科学课程的一部分教授的。

You need to start by understanding that there is no "perfect" implementationof these functions for all applications. Therefore, superficial answers to questions like "which one is fastest" are guaranteed to be wrong.

您需要首先了解对于所有应用程序没有这些功能的“完美”实现。因此,对于诸如“哪个最快”之类的问题的肤浅回答肯定是错误的。

Most people who ask this question don't understand the importance of the tradeoffs between performance and accuracy. In particular, you are going to have to make some choices regarding the accuracy of the calculations before you do anything else. How much error can you tolerate in the result? 10^-4? 10^-16?

大多数问这个问题的人不明白在性能和准确性之间权衡的重要性。特别是,在做任何其他事情之前,您将不得不就计算的准确性做出一些选择。你能在结果中容忍多少错误?10^-4?10^-16?

Unless you can quantify the error in any method, don't use it.See all those random answers below mine, that post a bunch of random uncommented source code, without clearly documenting the algorithm used and its maximum error across the input range? That's strictly bush league.

除非你可以用任何方法量化错误,否则不要使用它。看到我下面的所有那些随机答案,这些答案发布了一堆随机的未注释源代码,而没有清楚地记录所使用的算法及其在输入范围内的最大误差?那是严格的布什联赛。

If you can't calculate the PRECISEmaximum error in your sine function, across the entire range of the inputs, then you can't write a sine function!

如果您无法在整个输入范围内计算正弦函数中的PRECISE最大误差,那么您就无法编写正弦函数!

No one uses the Taylor series alone to approximate transcendentals in software. Except for certain highly specific cases, Taylor series generally approach the target slowly across common input ranges.

没有人单独使用泰勒级数来近似软件中的超越数。除了某些高度特定的情况外,泰勒级数通常会在常见的输入范围内缓慢接近目标

The algorithms that your grandparents used to calculate transcendentals efficiently, are collectively referred to as CORDICand were simple enough to be implemented in hardware. Here is a well documented CORDIC implementation in C. CORDIC implementations, usually, require a very small lookup table, but most implementations don't even require that a hardware multiplier be available. Most CORDIC implementations let you trade off performance for accuracy, including the one I linked.

您的祖父母用来有效计算超越数的算法统称为CORDIC,并且非常简单,可以在硬件中实现。这是C 中一个有据可查的 CORDIC 实现。CORDIC 实现通常需要一个非常小的查找表,但大多数实现甚至不需要硬件乘法器可用。大多数 CORDIC 实现都允许您权衡性能以换取准确性,包括我链接的那个。

There's been a lot of incremental improvements to the original CORDIC algorithms over the years. For example, last year some researchers in Japan published an articleon an improved CORDIC with better rotation angles, which reduces the operations required.

多年来,对原始 CORDIC 算法进行了大量增量改进。例如,去年日本的一些研究人员发表了一篇文章,介绍了一种具有更好旋转角度的改进型 CORDIC,从而减少了所需的操作。

If you have hardware multipliers sitting around (and you almost certainly do), or if you can't afford a lookup table like CORDIC requires, you can always use a Chebyshev polynomialto do the same thing. Chebyshev polynomials require multiplies, but this is rarely a problem on modern hardware. We like Chebyshev polynomials because they have highly predictable maximum errors for a given approximation. The maximum of the last term in a Chebyshev polynomial, across your input range, bounds the error in the result. And this error gets smaller as the number of terms gets larger. Here's one exampleof a Chebyshev polynomial giving a sine approximation across a huge range, ignoring the natural symmetry of the sine function and just solving the approximation problem by throwing more coefficients at it. And here's an example of estimating a sine function to within 5 ULPs. Don't know what an ULP is? You should.

如果您有硬件乘法器(而且您几乎肯定会这样做),或者如果您负担不起 CORDIC 所需的查找表,您总是可以使用切比雪夫多项式来做同样的事情。切比雪夫多项式需要乘法,但这在现代硬件上很少成为问题。我们喜欢切比雪夫多项式,因为它们对于给定的近似值具有高度可预测的最大误差。切比雪夫多项式中最后一项的最大值在您的输入范围内,限制了结果中的误差。并且这个误差随着项数变大而变小。 这是一个例子切比雪夫多项式给出了一个大范围内的正弦近似,忽略了正弦函数的自然对称性,只是通过向它抛出更多系数来解决近似问题。 这是估计 5 个 ULP 以内的正弦函数的示例。不知道 ULP 是什么? 你应该。

We also like Chebyshev polynomials because the error in the approximation is equally distributed across the range of outputs. If you're writing audio plugins or doing digital signal processing, Chebyshev polynomials give you a cheap and predictable dithering effect "for free."

我们也喜欢切比雪夫多项式,因为近似误差在输出范围内均匀分布。如果您正在编写音频插件或进行数字信号处理,切比雪夫多项式可以“免费”提供廉价且可预测的抖动效果。

If you want to find your own Chebyshev polynomial coefficients across a specific range, many math libraries call the process of finding those coefficients "Chebyshev fit" or something like that.

如果您想在特定范围内找到自己的切比雪夫多项式系数,许多数学库将寻找这些系数的过程称为“切比雪夫拟合”或类似的东西。

Square roots, then as now, are usually calculated with some variant of the Newton-Raphson algorithm, usually with a fixed number of iterations. Usually, when someone cranks out an "amazing new"algorithm for doing square roots, it's merely Newton-Raphson in disguise.

和现在一样,平方根通常使用Newton-Raphson 算法的一些变体计算,通常使用固定次数的迭代。通常,当有人想出一个“惊人的新”算法来计算平方根时,它只不过是伪装的 Newton-Raphson。

Newton-Raphson, CORDIC, and Chebyshev polynomials let you trade off speed for accuracy, so the answer can be just as imprecise as you want.

Newton-Raphson、CORDIC 和 Chebyshev 多项式让您可以在速度和准确性之间进行权衡,因此答案可以与您希望的一样不精确。

Lastly, when you've finished all your fancy benchmarking and micro-optimization, make sure that your "fast" version is actually faster than the library version. Here is a typical library implementation of fsin()bounded on the domain from -pi/4 to pi/4. And it just ain't that damned slow.

最后,当您完成所有花哨的基准测试和微优化后,请确保您的“快速”版本实际上比库版本快。 下面是一个典型的 fsin() 库实现,在从 -pi/4 到 pi/4 的域上有界。而且它并没有那么慢。

One last caution to you: you're almost certainly using IEEE-754 math to perform your estimations, and anytime you're performing IEEE-754 math with a bunch of multiplies, then some obscure engineering decisions made decades ago will come back to haunt you, in the form of roundoff errors. And those errors start small, but they get bigger, and Bigger, and BIGGER! At some point in your life, please read "What every computer scientist should know about floating point numbers"and have the appropriate amount of fear. Keep in mind that if you start writing your own transcendental functions, you'll need to benchmark and measure the ACTUAL error due to floating-point roundoff, not just the maximum theoretical error. This is not a theoretical concern; "fast math" compilation settings have bit me in the butt, on more than one project.

最后提醒您:您几乎肯定会使用 IEEE-754 数学来执行您的估计,并且无论何时您使用一堆乘法来执行 IEEE-754 数学,那么几十年前做出的一些晦涩难懂的工程决策将再次困扰您你,以舍入误差的形式。这些错误开始很小,但它们会变得越来越大,越来越大,越来越大!在您生命中的某个时刻,请阅读“每个计算机科学家都应该了解的关于浮点数的知识”并有适当的恐惧。请记住,如果您开始编写自己的超越函数,则需要对浮点舍入导致的实际误差进行基准测试和测量,而不仅仅是最大理论误差。这不是理论上的问题;在多个项目中,“快速数学”编译设置让我大吃一惊。

tl:dr; go google "sine approximation" or "cosine approximation" or "square root approximation" or "approximation theory."

tl:博士;去谷歌搜索“正弦近似”或“余弦近似”或“平方根近似”或“近似理论”

回答by BigTailWolf

For square root, there is an approach called bit shift.

对于平方根,有一种称为位移的方法。

A float number defined by IEEE-754 is using some certain bit represent describe times of multiple based 2. Some bits are for represent the base value.

IEEE-754 定义的浮点数使用某些位来表示多个基数 2 的描述次数。某些位用于表示基值。

float squareRoot(float x)
{
  unsigned int i = *(unsigned int*) &x;

  // adjust bias
  i  += 127 << 23;
  // approximation of square root
  i >>= 1;

  return *(float*) &i;
}

That's a constant time calculating the squar root

这是计算平方根的恒定时间

回答by milianw

Based on the idea of http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648and some manual rewriting to improve the performance in a micro benchmark I ended up with the following cosine implementation which is used in a HPC physics simulation that is bottlenecked by repeated cos calls on a large number space. It's accurate enough and much faster than a lookup table, most notably no division is required.

基于http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648的想法和一些手动重写以提高微基准测试的性能,我最终得到了以下余弦实现用于 HPC 物理模拟,该模拟因大量空间上的重复 cos 调用而出现瓶颈。它足够准确并且比查找表快得多,最值得注意的是不需要除法。

template<typename T>
inline T cos(T x) noexcept
{
    constexpr T tp = 1./(2.*M_PI);
    x *= tp;
    x -= T(.25) + std::floor(x + T(.25));
    x *= T(16.) * (std::abs(x) - T(.5));
    #if EXTRA_PRECISION
    x += T(.225) * x * (std::abs(x) - T(1.));
    #endif
    return x;
}

The Intel compiler at least is also smart enough in vectorizing this function when used in a loop.

当在循环中使用时,英特尔编译器至少在矢量化此函数方面也足够聪明。

If EXTRA_PRECISION is defined, the maximum error is about 0.00109 for the range -π to π, assuming Tis doubleas it's usually defined in most C++ implementations. Otherwise, the maximum error is about 0.056 for the same range.

如果EXTRA_PRECISION被定义,最大误差约为0.00109为范围-π到π,假设Tdouble因为它在大多数C ++实现通常定义。否则,相同范围的最大误差约为 0.056。

回答by Peter Cordes

For x86, the hardware FP square root instructions are fast (sqrtssis sqrt Scalar Single-precision). Single precision is faster than double-precision, so definitely use floatinstead of doublefor code where you can afford to use less precision.

对于 x86,硬件 FP 平方根指令很快(sqrtss是 sqrt Scalar Single-precision)。单精度比双精度快,因此绝对float不要double用于您可以负担得起使用较低精度的代码。

For 32bit code, you usually need compiler options to get it to do FP math with SSE instructions, rather than x87. (e.g. -mfpmath=sse)

对于 32 位代码,您通常需要编译器选项才能使用 SSE 指令进行 FP 数学运算,而不是 x87。(例如-mfpmath=sse

To get C's sqrt()or sqrtf()functions to inline as just sqrtsdor sqrtss, you need to compile with -fno-math-errno. Having math functions set errnoon NaN is generally considered a design mistake, but the standard requires it. Without that option, gcc inlines it but then does a compare+branch to see if the result was NaN, and if so calls the library function so it can set errno. If your program doesn't check errnoafter math functions, there is no danger in using -fno-math-errno.

要将 Csqrt()sqrtf()函数作为sqrtsd或内联sqrtss您需要使用-fno-math-errno. 在errnoNaN 上设置数学函数通常被认为是设计错误,但标准要求这样做。如果没有该选项,gcc 会内联它,然后执行 compare+branch 以查看结果是否为 NaN,如果是,则调用库函数以便它可以设置errno. 如果您的程序没有errno在数学函数之后进行检查,则使用-fno-math-errno.

You don't need any of the "unsafe" parts of -ffast-mathto get sqrtand some other functions to inline better or at all, but -ffast-mathcan make a big difference (e.g. allowing the compiler to auto-vectorize in cases where that changes the result, because FP math isn't associative.

您不需要-ffast-mathget 的任何“不安全”部分sqrt和其他一些函数来更好地内联或根本不需要内联,但-ffast-math可以产生很大的不同(例如,允许编译器在结果发生变化的情况下自动矢量化,因为FP 数学不是关联的

e.g. with gcc6.3 compiling float foo(float a){ return sqrtf(a); }

例如使用 gcc6.3 编译float foo(float a){ return sqrtf(a); }

foo:    # with -O3 -fno-math-errno.
    sqrtss  xmm0, xmm0
    ret

foo:   # with just -O3
    pxor    xmm2, xmm2   # clang just checks for NaN, instead of comparing against zero.
    sqrtss  xmm1, xmm0
    ucomiss xmm2, xmm0
    ja      .L8          # take the slow path if 0.0 > a
    movaps  xmm0, xmm1
    ret

.L8:                     # errno-setting path
    sub     rsp, 24
    movss   DWORD PTR [rsp+12], xmm1   # store the sqrtss result because the x86-64 SysV ABI has no call-preserved xmm regs.
    call    sqrtf                      # call sqrtf just to set errno
    movss   xmm1, DWORD PTR [rsp+12]
    add     rsp, 24
    movaps  xmm0, xmm1    # extra mov because gcc reloaded into the wrong register.
    ret

gcc's code for the NaN case seems way over-complicated; it doesn't even use the sqrtfreturn value! Anyway, this is the kind of mess you actually get without -fno-math-errno, for every sqrtf()call site in your program. Mostly it's just code-bloat, and none of the .L8block will ever run when taking the sqrt of a number >= 0.0, but there's still several extra instructions in the fast path.

gcc 的 NaN 案例代码似乎过于复杂;它甚至不使用sqrtf返回值!无论如何,-fno-math-errno对于sqrtf()程序中的每个调用站点,这就是您实际上没有的那种混乱。大多数情况下,这只是代码膨胀,并且.L8在取大于等于 0.0 的数的 sqrt 时,任何块都不会运行,但在快速路径中仍有一些额外的指令。



If you know that your input to sqrtis non-zero, you can use the fast but very approximate reciprocal sqrt instruction, rsqrtps(or rsqrtssfor the scalar version). One Newton-Raphson iterationbrings it up to nearly the same precision as the hardware single-precision sqrtinstruction, but not quite.

如果您知道您的输入为sqrt非零,则可以使用快速但非常近似的倒数 sqrt 指令rsqrtps(或rsqrtss用于标量版本)。一次Newton-Raphson 迭代使其达到与硬件单精度sqrt指令几乎相同的精度,但不完全相同。

sqrt(x) = x * 1/sqrt(x), for x!=0, so you can calculate a sqrt with rsqrt and a multiply. These are both fast, even on P4 (was that still relevant in 2013)?

sqrt(x) = x * 1/sqrt(x), for x!=0, 所以你可以用 rsqrt 和乘法计算一个 sqrt。这些都很快,即使在 P4 上也是如此(这在 2013 年仍然相关)?

On P4, it may be worth using rsqrt+ Newton iteration to replace a single sqrt, even if you don't need to divide anything by it.

在 P4 上,可能值得使用rsqrt+ Newton 迭代来替换单个sqrt,即使您不需要除以它。

See also an answer I wrote recently about handling zeroes when calculating sqrt(x)as x*rsqrt(x), with a Newton Iteration. I included some discussion of rounding error if you want to convert the FP value to an integer, and links to other relevant questions.

另请参阅我最近写的关于sqrt(x)x*rsqrt(x)使用牛顿迭代计算as时处理零的答案。如果您想将 FP 值转换为整数,我还讨论了舍入误差,并提供了其他相关问题的链接。



P4:

P4:

  • sqrtss: 23c latency, not pipelined
  • sqrtsd: 38c latency, not pipelined
  • fsqrt(x87): 43c latency, not pipelined
  • rsqrtss/ mulss: 4c + 6c latency. Possibly one per 3c throughput, since they apparently don't need the same execution unit (mmx vs. fp).

  • SIMD packed versions are somewhat slower

  • sqrtss: 23c 延迟,非流水线
  • sqrtsd: 38c 延迟,非流水线
  • fsqrt(x87):43c 延迟,非流水线
  • rsqrtss/ mulss: 4c + 6c 延迟。可能每 3c 吞吐量一个,因为它们显然不需要相同的执行单元(mmx 与 fp)。

  • SIMD 打包版本有点慢



Skylake:

天湖:

  • sqrtss/sqrtps: 12c latency, one per 3c throughput
  • sqrtsd/sqrtpd: 15-16c latency, one per 4-6c throughput
  • fsqrt(x87): 14-21cc latency, one per 4-7c throughput
  • rsqrtss/ mulss: 4c + 4c latency. One per 1c throughput.
  • SIMD 128b vector versions are the same speed. 256b vector versions are a bit higher latency, almost half throughput. The rsqrtssversion has full performance for 256b vectors.
  • sqrtss/ sqrtps: 12c 延迟,每 3c 吞吐量一个
  • sqrtsd/ sqrtpd: 15-16c 延迟,每 4-6c 吞吐量一个
  • fsqrt(x87):14-21cc 延迟,每 4-7c 吞吐量一个
  • rsqrtss/ mulss: 4c + 4c 延迟。每 1c 吞吐量一个。
  • SIMD 128b 矢量版本具有相同的速度。256b 矢量版本的延迟稍高,几乎是吞吐量的一半。该rsqrtss版本具有 256b 向量的全部性能。

With a Newton Iteration, the rsqrtversion is not much if at all faster.

使用牛顿迭代,rsqrt版本即使更快也不会太多。



Numbers from Agner Fog's experimental testing. See his microarch guides to understand what makes code run fast or slow. Also see links at the x86tag wiki.

来自Agner Fog 的实验测试的数字。查看他的微架构指南,了解是什么让代码运行得快或慢。另请参阅x86标签 wiki 中的链接。

IDK how best to calculate sin/cos. I've read that the hardware fsin/ fcos(and the only slightly slower fsincosthat does both at once) are not the fastest way, but IDK what is.

IDK 如何最好地计算正弦/余弦。我读过硬件fsin/ fcos(以及fsincos同时执行两者的唯一稍微慢一点的方法)不是最快的方式,但 IDK 是什么。

回答by Adriel Jr

QT has fast implementations of sine (qFastSin) and cosine (qFastCos) that uses look up table with interpolation. I'm using it in my code and they are faster than std:sin/cos and precise enough for what I need (error ~= 0.01% I would guess):

QT 具有使用带插值的查找表的正弦 (qFastSin) 和余弦 (qFastCos) 的快速实现。我在我的代码中使用它,它们比 std:sin/cos 更快,并且足够精确以满足我的需要(错误 ~= 0.01% 我猜):

https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.h.html#_Z8qFastSind

https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.h.html#_Z8qFastSind

#define QT_SINE_TABLE_SIZE 256


inline qreal qFastSin(qreal x)
{
   int si = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - si * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int ci = si + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] + (qt_sine_table[ci] - 0.5 * qt_sine_table[si] * d) * d;
}

inline qreal qFastCos(qreal x)
{
   int ci = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI)); // Would be more accurate with qRound, but slower.
   qreal d = x - ci * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
   int si = ci + QT_SINE_TABLE_SIZE / 4;
   si &= QT_SINE_TABLE_SIZE - 1;
   ci &= QT_SINE_TABLE_SIZE - 1;
   return qt_sine_table[si] - (qt_sine_table[ci] + 0.5 * qt_sine_table[si] * d) * d;
}

The LUT and license can be found here: https://code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.cpp.html#qt_sine_table

LUT 和许可证可以在这里找到:https: //code.woboq.org/qt5/qtbase/src/corelib/kernel/qmath.cpp.html#qt_sine_table

These pair of functions take radian inputs. The LUT covers the entire 2π input range. The function interpolates between values using the difference d, using the cosine (with a similar interpolation using sine again) as the derivative.

这对函数采用弧度输入。LUT 覆盖整个 2π 输入范围。该函数使用差值在值之间进行插值d,使用余弦(再次使用正弦进行类似的插值)作为导数。

回答by Josh

Sharing my code, it's a 6th degree polynomial, nothing special but rearranged to avoid pows. On Core i7 this is 2.3 times slower than standard implementation, although a bit faster for [0..2*PI] range. For an old processor this could be an alternative to standard sin/cos.

分享我的代码,它是一个 6 次多项式,没什么特别的,只是为了避免pows重新排列。在 Core i7 上,这比标准实现慢 2.3 倍,尽管在 [0..2*PI] 范围内快一点。对于旧处理器,这可能是标准正弦/余弦的替代品。

/*
    On [-1000..+1000] range with 0.001 step average error is: +/- 0.000011, max error: +/- 0.000060
    On [-100..+100] range with 0.001 step average error is:   +/- 0.000009, max error: +/- 0.000034
    On [-10..+10] range with 0.001 step average error is:     +/- 0.000009, max error: +/- 0.000030
    Error distribution ensures there's no discontinuity.
*/

const double PI          = 3.141592653589793;
const double HALF_PI     = 1.570796326794897;
const double DOUBLE_PI   = 6.283185307179586;
const double SIN_CURVE_A = 0.0415896;
const double SIN_CURVE_B = 0.00129810625032;

double cos1(double x) {
    if (x < 0) {
        int q = -x / DOUBLE_PI;
        q += 1;
        double y = q * DOUBLE_PI;
        x = -(x - y);
    }
    if (x >= DOUBLE_PI) {
        int q = x / DOUBLE_PI;
        double y = q * DOUBLE_PI;
        x = x - y;
    }
    int s = 1;
    if (x >= PI) {
        s = -1;
        x -= PI;
    }
    if (x > HALF_PI) {
        x = PI - x;
        s = -s;
    }
    double z = x * x;
    double r = z * (z * (SIN_CURVE_A - SIN_CURVE_B * z) - 0.5) + 1.0;
    if (r > 1.0) r = r - 2.0;
    if (s > 0) return r;
    else return -r;
}

double sin1(double x) {
    return cos1(x - HALF_PI);
}

回答by Fran x1024

This is a sinus implementation that should be quite fast, it works like this:

这是一个应该非常快的窦性实现,它的工作原理是这样的:

it has an arithemtical implementation of square rooting complex numbers

它有平方根复数的算术实现

from analitical math with complex numbers you know that the angle is halfed when a complex number is square rooted

从带有复数的分析数学中,您知道当复数取平方根时,角度减半

You can take a complex number whose angle you already know (e.g. i, has angle 90 degrees or PI / 2 radians)

您可以取一个您已经知道其角度的复数(例如,我的角度为 90 度或 PI / 2 弧度)

Than by square rooting it you can get complex numbers of form cos (90 / 2^n) + i sin (90 / 2^n)

比起平方根,你可以得到形式为 cos (90 / 2^n) + i sin (90 / 2^n) 的复数

from analitical math with complex numbers you know that when two numbers multiply their angles add up

从带有复数的分析数学中,您知道当两个数字乘以它们的角度时

you can show the number k (one you get as an argument in sin or cos) as sum of angles 90 / 2^n and then get the resulting values by multiplying those complex numbers you precomputed

您可以将数字 k(您在 sin 或 cos 中作为参数得到的一个)显示为角度 90 / 2^n 的总和,然后通过乘以您预先计算的那些复数来获得结果值

result will be in form cos k + i sin k

结果将采用 cos k + i sin k 的形式

#define PI 3.14159265
#define complex pair <float, float>

/* this is square root function, uses binary search and halves mantisa */

float sqrt(float a) {

    float b = a;

    int *x = (int*) (&b); // here I get integer pointer to float b which allows me to directly change bits from float reperesentation

    int c = ((*x >> 23) & 255) - 127; // here I get mantisa, that is exponent of 2 (floats are like scientific notation 1.111010101... * 2^n)

    if(c < 0) c = -((-c) >> 1); // ---
                                //   |--> This is for halfing the mantisa
    else c >>= 1;               // ---

    *x &= ~(255 << 23); // here space reserved for mantisa is filled with 0s

    *x |= (c + 127) << 23; // here new mantisa is put in place

    for(int i = 0; i < 5; i++) b = (b + a / b) / 2; // here normal square root approximation runs 5 times (I assume even 2 or 3 would be enough)

    return b;
}

/* this is a square root for complex numbers (I derived it in paper), you'll need it later */

complex croot(complex x) {

    float c = x.first, d = x.second;

    return make_pair(sqrt((c + sqrt(c * c + d * d)) / 2), sqrt((-c + sqrt(c * c + d * d)) / 2) * (d < 0 ? -1 : 1));
}

/* this is for multiplying complex numbers, you'll also need it later */

complex mul(complex x, complex y) {

    float a = x.first, b = x.second, c = y.first, d = y.second;

    return make_pair(a * c - b * d, a * d + b * c);
}

/* this function calculates both sinus and cosinus */

complex roots[24];

float angles[24];

void init() {

    complex c = make_pair(-1, 0); // first number is going to be -1

    float alpha = PI; // angle of -1 is PI

    for(int i = 0; i < 24; i++) {

        roots[i] = c; // save current c

        angles[i] = alpha; // save current angle

        c = croot(c); // root c

        alpha *= 0.5; // halve alpha
    }
}

complex cosin(float k) {

    complex r = make_pair(1, 0); // at start 1

    for(int i = 0; i < 24; i++) {

        if(k >= angles[i]) { // if current k is bigger than angle of c

            k -= angles[i]; // reduce k by that number

            r = mul(r, roots[i]); // multiply the result by c
        }
    }

    return r; // here you'll have a complex number equal to cos k + i sin k.
}

float sin(float k) {

    return cosin(k).second;
}

float cos(float k) {

    return cosin(k).first;
}

Now if you still find it slow you can reduce number of iterations in function cosin(note that the precision will be reduced)

现在如果你仍然发现它很慢,你可以减少函数中的迭代次数cosin(注意精度会降低)

回答by Dhairya

I use the following CORDICcode to compute trigonometric functions in quadruple precision. The constant N determines the number of bits of precision required (for example N=26 will give single precision accuracy). Depending on desired accuracy, the precomputed storage can be small and will fit in the cache. It only requires addition and multiplication operations and is also very easy to vectorize.

我使用以下CORDIC代码以四倍精度计算三角函数。常数 N 确定所需的精度位数(例如 N=26 将提供单精度精度)。根据所需的准确性,预先计算的存储可以很小并且适合缓存。它只需要加法和乘法运算,也很容易向量化。

The algorithm pre-computes sin and cos values for 0.5^i, i=1,...,N. Then, we can combine these precomputed values, to compute sin and cos for any angle up to a resolution of 0.5^N

该算法预先计算 0.5^i, i=1,...,N 的 sin 和 cos 值。然后,我们可以结合这些预先计算的值,计算任何角度的 sin 和 cos,分辨率高达 0.5^N

template <class QuadReal_t>
QuadReal_t sin(const QuadReal_t a){
  const int N=128;
  static std::vector<QuadReal_t> theta;
  static std::vector<QuadReal_t> sinval;
  static std::vector<QuadReal_t> cosval;
  if(theta.size()==0){
    #pragma omp critical (QUAD_SIN)
    if(theta.size()==0){
      theta.resize(N);
      sinval.resize(N);
      cosval.resize(N);

      QuadReal_t t=1.0;
      for(int i=0;i<N;i++){
        theta[i]=t;
        t=t*0.5;
      }

      sinval[N-1]=theta[N-1];
      cosval[N-1]=1.0-sinval[N-1]*sinval[N-1]/2;
      for(int i=N-2;i>=0;i--){
        sinval[i]=2.0*sinval[i+1]*cosval[i+1];
        cosval[i]=sqrt(1.0-sinval[i]*sinval[i]);
      }
    }
  }

  QuadReal_t t=(a<0.0?-a:a);
  QuadReal_t sval=0.0;
  QuadReal_t cval=1.0;
  for(int i=0;i<N;i++){
    while(theta[i]<=t){
      QuadReal_t sval_=sval*cosval[i]+cval*sinval[i];
      QuadReal_t cval_=cval*cosval[i]-sval*sinval[i];
      sval=sval_;
      cval=cval_;
      t=t-theta[i];
    }
  }
  return (a<0.0?-sval:sval);
}