C++ 浮点除法与浮点乘法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4125033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 14:35:13  来源:igfitidea点击:

Floating point division vs floating point multiplication

c++floating-pointmicro-optimization

提问by sum1stolemyname

Is there any (non-microoptimization) performance gain by coding

编码是否有任何(非微优化)性能提升

float f1 = 200f / 2

in comparision to

相比

float f2 = 200f * 0.5

A professor of mine told me a few years ago that floating point divisions were slower than floating point multiplications without elaborating the why.

几年前,我的一位教授告诉我,浮点除法比浮点乘法慢,但没有详细说明原因。

Does this statement hold for modern PC architecture?

这种说法是否适用于现代 PC 架构?

Update1

更新1

In respect to a comment, please do also consider this case:

关于评论,请同时考虑这种情况:

float f1;
float f2 = 2
float f3 = 3;
for( i =0 ; i < 1e8; i++)
{
  f1 = (i * f2 + i / f3) * 0.5; //or divide by 2.0f, respectively
}

Update 2Quoting from the comments:

更新 2引用评论:

[I want] to know what are the algorithmic / architectural requirements that cause > division to be vastly more complicated in hardware than multiplication

[我想] 知道什么是算法/架构要求导致 > 除法在硬件上比乘法复杂得多

回答by Gabe

Yes, many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).

是的,许多 CPU 可以在 1 或 2 个时钟周期内执行乘法,但除法总是需要更长的时间(尽管 FP 除法有时比整数除法快)。

If you look at this answeryou will see that division can exceed 24 cycles.

如果您查看此答案,您会发现除法可以超过 24 个周期。

Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.

为什么除法比乘法花费的时间长这么多?如果您还记得小学时,您可能还记得乘法基本上可以通过许多同时进行的加法来执行。除法需要无法同时执行的迭代减法,因此需要更长的时间。事实上,一些 FP 单元通过执行倒数近似并乘以它来加速除法。它不是那么准确,但速度更快。

回答by Peter Cordes

Be very careful with division, and avoid it when possible. For example, hoist float inverse = 1.0f / divisor;out of a loop and multiply by inverseinside the loop. (If the rounding error in inverseis acceptable)

划分时要非常小心,并尽可能避免。例如,float inverse = 1.0f / divisor;从循环中提升并inverse在循环内乘以。(如果舍入误差inverse是可以接受的)

Usually 1.0/xwill not be exactly-representable as a floator double. It will be exact when xis a power of 2. This lets compilers optimize x / 2.0fto x * 0.5fwithout any change in the result.

通常1.0/x不会完全表示为 afloatdouble。什么时候x是 2 的幂将是准确的。这让编译器可以优化x / 2.0fx * 0.5f而不会改变结果。

To let the compiler do this optimization for you even when the result won't be exact (or with a runtime-variable divisor), you need options like gcc -O3 -ffast-math. Specifically, -freciprocal-math(enabled by -funsafe-math-optimizationsenabled by -ffast-math) lets the compiler replace x / ywith x * (1/y)when that's useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think it does, but I forget).

为了让编译器在结果不准确(或使用运行时变量除数)时为您进行优化,您需要像 gcc -O3 -ffast-math. 具体来说,-freciprocal-math(由 enabled by-funsafe-math-optimizations启用-ffast-math)让编译器在有用时替换x / yx * (1/y)。其他编译器也有类似的选项,ICC 可能会默认启用一些“不安全”的优化(我认为确实如此,但我忘记了)。

-ffast-mathis often important to allow auto-vectorization of FP loops, especially reductions (e.g. summing an array into one scalar total), because FP math is not associative. Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

-ffast-math允许 FP 循环的自动矢量化通常很重要,尤其是减少(例如,将数组相加为一个标量总数),因为 FP 数学不是关联的。 为什么 GCC 不将 a*a*a*a*a*a 优化为 (a*a*a)*(a*a*a)?

Also note that C++ compilers can fold +and *into an FMA in some cases (when compiling for a target that supports it, like -march=haswell), but they can't do that with /.

另请注意,C++ 编译器在某些情况下可以折叠+*进入 FMA(当为支持它的目标编译时,例如-march=haswell),但它们不能使用/.



Division has worse latency than multiplication or addition (or FMA) by a factor of 2 to 4 on modern x86 CPUs, and worse throughput by a factor of 6 to 401(for a tight loop doing onlydivision instead of onlymultiplication).

在现代 x86 CPU 上,除法的延迟比乘法或加法(或FMA)差2 到 4 倍,吞吐量差 6 到 40 1(对于只执行除法而不是只执行乘法的紧密循环)。

The divide / sqrt unit is not fully pipelined, for reasons explained in @NathanWhitehead's answer. The worst ratios are for 256b vectors, because (unlike other execution units) the divide unit is usually not full-width, so wide vectors have to be done in two halves. A not-fully-pipelined execution unit is so unusual that Intel CPUs have an arith.divider_activehardware performance counter to help you find code that bottlenecks on divider throughput instead of the usual front-end or execution port bottlenecks. (Or more often, memory bottlenecks or long latency chains limiting instruction-level parallelism causing instruction throughput to be less than ~4 per clock).

由于@NathanWhitehead's answer 中解释的原因,divide / sqrt 单元没有完全流水线化。最差的比率是 256b 向量,因为(与其他执行单元不同)除法单元通常不是全宽的,因此宽向量必须分为两半。未完全流水线化的执行单元是如此不寻常,以至于 Intel CPU 有一个arith.divider_active硬件性能计数器来帮助您找到在分频器吞吐量上出现瓶颈的代码,而不是通常的前端或执行端口瓶颈。(或者更常见的是,内存瓶颈或长延迟链限制了指令级并行性,导致指令吞吐量低于每个时钟约 4 条)。

However, FP division and sqrt on Intel and AMD CPUs (other than KNL) is implemented as a single uop, so it doesn't necessarily have a big throughput impact on surrounding code. The best case for division is when out-of-order execution can hide the latency, and when there are lots of multiplies and adds (or other work) that can happen in parallel with the divide.

但是,Intel 和 AMD CPU(KNL 除外)上的 FP 除法和 sqrt 是作为单个 uop 实现的,因此它不一定对周围的代码产生很大的吞吐量影响。除法的最佳情况是当乱序执行可以隐藏延迟时,以及当有大量乘法和加法(或其他工作)可以与除法并行发生时。

(Integer division is microcoded as multiple uops on Intel, so it always has more impact on surrounding code that integer multiply. There's less demand for high-performance integer division, so the hardware support isn't as fancy. Related: microcoded instructions like idivcan cause alignment-sensitive front-end bottlenecks.)

(整数除法在 Intel 上被微编码为多个 uops,因此它总是对整数乘法的周围代码产生更大的影响。对高性能整数除法的需求较少,因此硬件支持不是那么花哨。相关:微编码指令,如idivcan导致对齐敏感的前端瓶颈。)

So for example, this will be really bad:

例如,这将非常糟糕:

for ()
    a[i] = b[i] / scale;  // division throughput bottleneck

// Instead, use this:
float inv = 1.0 / scale;
for ()
    a[i] = b[i] * inv;  // multiply (or store) throughput bottleneck

All you're doing in the loop is load/divide/store, and they're independent so it's throughput that matters, not latency.

您在循环中所做的只是加载/划分/存储,它们是独立的,因此重要的是吞吐量,而不是延迟。

A reduction like accumulator /= b[i]would bottleneck on divide or multiply latency, rather than throughput. But with multiple accumulators that you divide or multiply at the end, you can hide the latency and still saturate the throughput. Note that sum += a[i] / b[i]bottlenecks on addlatency or divthroughput, but not divlatency because the division isn't on the critical path (the loop-carried dependency chain).

accumulator /= b[i]除法或乘法延迟这样的减少会成为瓶颈,而不是吞吐量。但是通过在最后进行除法或乘法的多个累加器,您可以隐藏延迟并仍然使吞吐量饱和。请注意延迟或吞吐量的sum += a[i] / b[i]瓶颈,而不是延迟,因为除法不在关键路径上(循环携带的依赖链)。adddivdiv



But in something like this (approximating a function like log(x)with a ratio of two polynomials), the divide can be pretty cheap:

但是在这样的事情中(近似一个函数,比如log(x)两个多项式的比率),除法可能非常便宜

for () {
    // (not shown: extracting the exponent / mantissa)
    float p = polynomial(b[i], 1.23, -4.56, ...);  // FMA chain for a polynomial
    float q = polynomial(b[i], 3.21, -6.54, ...);
    a[i] = p/q;
}

For log()over the range of the mantissa, a ratio of two polynomials of order N has much less error than a single polynomial with 2N coefficients, and evaluating 2 in parallel gives you some instruction-level parallelism within a single loop body instead of one massively long dep chain, making things a LOT easier for out-of-order execution.

对于log()尾数的范围,两个 N 阶多项式的比率比具有 2N 个系数的单个多项式的误差要小得多,并且并行计算 2 可以在单个循环体中提供一些指令级并行性,而不是一个非常长的循环体dep 链,使乱序执行变得更容易。

In this case, we don't bottleneck on divide latency because out-of-order execution can keep multiple iterations of the loop over the arrays in flight.

在这种情况下,我们不会在除法延迟上遇到瓶颈,因为乱序执行可以使循环在运行中的数组上进行多次迭代。

We don't bottleneck on divide throughputas long as our polynomials are big enough that we only have one divide for every 10 FMA instructions or so. (And in a real log()use case, there's be a bunch of work extracting exponent / mantissa and combining things back together again, so there's even more work to do between divides.)

只要我们的多项式足够大,每 10 个 FMA 指令只有一个除法,我们就不会对除法吞吐量造成瓶颈。(在实际log()用例中,有大量工作需要提取指数/尾数并将事物重新组合在一起,因此在除法之间还有更多工作要做。)



When you do need to divide, usually it's best to just divide instead of rcpps

当你确实需要划分时,通常最好只是划分而不是 rcpps

x86 has an approximate-reciprocal instruction (rcpps), which only gives you 12 bits of precision. (AVX512F has 14 bits, and AVX512ER has 28 bits.)

x86 有一个近似倒数指令 ( rcpps),它只给你 12 位的精度。(AVX512F 有 14 位,AVX512ER 有 28 位。)

You can use this to do x / y = x * approx_recip(y)without using an actual divide instruction. (rcppsitsef is fairly fast; usually a bit slower than multiplication. It uses a table lookup from a table internal to the CPU. The divider hardware may use the same table for a starting point.)

您可以使用它来完成,x / y = x * approx_recip(y)而无需使用实际的除法指令。(rcppsitef 相当快;通常比乘法慢一点。它使用从 CPU 内部表中查找表。除法器硬件可能使用相同的表作为起点。)

For most purposes, x * rcpps(y)is too inaccurate, and a Newton-Raphson iteration to double the precision is required. But that costs you 2 multiplies and 2 FMAs, and has latency about as high as an actual divide instruction. If allyou're doing is division, then it can be a throughput win. (But you should avoid that kind of loop in the first place if you can, maybe by doing the division as part of another loop that does other work.)

对于大多数用途来说,x * rcpps(y)太不准确了,需要进行牛顿-拉夫森迭代以将精度加倍。但这会花费您2 个乘法和 2 个 FMA,并且延迟与实际除法指令一样高。如果所有你正在做的是分裂,那么它可以是一个吞吐量胜利。(但是,如果可以的话,您应该首先避免这种循环,也许可以将除法作为另一个执行其他工作的循环的一部分进行。)

But if you're using division as part of a more complex function, the rcppsitself + the extra mul + FMA usually makes it faster to just divide with a divpsinstruction, except on CPUs with very low divpsthroughput.

但是,如果您将除法用作更复杂函数的一部分,则rcpps它本身 + 额外的 mul + FMA 通常会使仅用divps指令除法更快,除非在divps吞吐量非常低的 CPU 上。

(For example Knight's Landing, see below. KNL supports AVX512ER, so for floatvectors the VRCP28PSresult is already accurate enough to just multiply without a Newton-Raphson iteration. floatmantissa size is only 24 bits.)

(例如 Knight's Landing,见下文。KNL 支持AVX512ER,因此对于float向量,VRCP28PS结果已经足够准确,可以在没有 Newton-Raphson 迭代的情况下进行乘法运算。 float尾数大小仅为 24 位。)



Specific numbers from Agner Fog's tables:

Agner Fog 表格中的具体数字:

Unlike every other ALU operation, division latency/throughput is data-dependent on some CPUs. Again, this is because it's so slow and not fully pipelined. Out-of-order scheduling is easier with fixed latencies, because it avoids write-back conflicts (when the same execution port tries to produce 2 results in the same cycle, e.g. from running a 3 cycle instruction and then two 1-cycle operations).

与其他所有 ALU 操作不同,除法延迟/吞吐量取决于某些 CPU 的数据。同样,这是因为它太慢而且没有完全流水线化。无序调度在固定延迟的情况下更容易,因为它避免了回写冲突(当同一个执行端口试图在同一个周期中产生 2 个结果时,例如从运行 3 个周期的指令然后运行两个 1 个周期的操作) .

Generally, the fastest cases are when the divisor is a "round" number like 2.0or 0.5(i.e. the base2 floatrepresentation has lots of trailing zeros in the mantissa).

通常,最快的情况是除数是一个“圆形”数字,如2.00.5(即 base2float表示在尾数中有很多尾随零)。

floatlatency(cycles) / throughput(cycles per instruction, running just that back to back with independent inputs):

float延迟(周期)/吞吐量(每条指令的周期,以独立输入背靠背运行):

                   scalar & 128b vector        256b AVX vector
                   divss      |  mulss
                   divps xmm  |  mulps           vdivps ymm | vmulps ymm

Nehalem          7-14 /  7-14 | 5 / 1           (No AVX)
Sandybridge     10-14 / 10-14 | 5 / 1        21-29 / 20-28 (3 uops) | 5 / 1
Haswell         10-13 / 7     | 5 / 0.5       18-21 /   14 (3 uops) | 5 / 0.5
Skylake            11 / 3     | 4 / 0.5          11 /    5 (1 uop)  | 4 / 0.5

Piledriver       9-24 / 5-10  | 5-6 / 0.5      9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops)
Ryzen              10 / 3     | 3 / 0.5         10  /    6 (2 uops) | 3 / 1 (2 uops)

 Low-power CPUs:
Jaguar(scalar)     14 / 14    | 2 / 1
Jaguar             19 / 19    | 2 / 1            38 /   38 (2 uops) | 2 / 2 (2 uops)

Silvermont(scalar)    19 / 17    | 4 / 1
Silvermont      39 / 39 (6 uops) | 5 / 2            (No AVX)

KNL(scalar)     27 / 17 (3 uops) | 6 / 0.5
KNL             32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)

doublelatency(cycles) / throughput(cycles per instruction):

double延迟(周期)/吞吐量(每条指令周期):

                   scalar & 128b vector        256b AVX vector
                   divsd      |  mulsd
                   divpd xmm  |  mulpd           vdivpd ymm | vmulpd ymm

Nehalem         7-22 /  7-22 | 5 / 1        (No AVX)
Sandybridge    10-22 / 10-22 | 5 / 1        21-45 / 20-44 (3 uops) | 5 / 1
Haswell        10-20 /  8-14 | 5 / 0.5      19-35 / 16-28 (3 uops) | 5 / 0.5
Skylake        13-14 /     4 | 4 / 0.5      13-14 /     8 (1 uop)  | 4 / 0.5

Piledriver      9-27 /  5-10 | 5-6 / 1       9-27 / 9-18 (2 uops)  | 5-6 / 1 (2 uops)
Ryzen           8-13 /  4-5  | 4 / 0.5       8-13 /  8-9 (2 uops)  | 4 / 1 (2 uops)

  Low power CPUs:
Jaguar            19 /   19  | 4 / 2            38 /  38 (2 uops)  | 4 / 2 (2 uops)

Silvermont(scalar) 34 / 32    | 5 / 2
Silvermont         69 / 69 (6 uops) | 5 / 2           (No AVX)

KNL(scalar)      42 / 42 (3 uops) | 6 / 0.5   (Yes, Agner really lists scalar as slower than packed, but fewer uops)
KNL              32 / 20 (18uops) | 6 / 0.5        32 / 32 (18 uops) | 6 / 0.5  (AVX and AVX512)

Ivybridge and Broadwell are different too, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its max clock speeds were lower.)

Ivybridge 和 Broadwell 也不同,但我想让桌子保持小一点。(Core2(在 Nehalem 之前)具有更好的分频器性能,但其最大时钟速度较低。)

Atom, Silvermont, and even Knight's Landing (Xeon Phi based on Silvermont) have exceptionally low divide performance, and even a 128b vector is slower than scalar. AMD's low-power Jaguar CPU (used in some consoles) is similar. A high-performance divider takes a lot of die area. Xeon Phi has low power per-core, and packing lots of cores on a die gives it tighter die-area constraints that Skylake-AVX512. It seems that AVX512ER rcp28ps/ pdis what you're "supposed" to use on KNL.

Atom、Silvermont,甚至Knight's Landing(基于Silvermont的Xeon Phi)的除法性能都异常低,甚至128b的向量也比标量慢。AMD 的低功耗 Jaguar CPU(用于某些游戏机)与此类似。高性能分配器占用大量芯片面积。至强融核的每核功耗低,并且在一个芯片上封装大量内核使其比 Skylake-AVX512 更严格的芯片面积限制。似乎 AVX512ER rcp28ps/pd是您“应该”在 KNL 上使用的。

(See this InstLatx64 resultfor Skylake-AVX512 aka Skylake-X. Numbers for vdivps zmm: 18c / 10c, so half the throughput of ymm.)

(请参阅Skylake-AVX512 又名 Skylake-X 的InstLatx64 结果。数字vdivps zmm:18c / 10c,因此吞吐量是 的一半ymm。)



Long latency chains become a problem when they're loop-carried, or when they're so long that they stop out-of-order execution from finding parallelism with other independent work.

长延迟链在循环进行时会成为问题,或者当它们太长以至于它们停止乱序执行以找到与其他独立工作的并行性时。



Footnote 1: how I made up those div vs. mul performance ratios:

脚注 1:我是如何构成这些 div 与 mul 性能比的:

FP divide vs. multiple performance ratios are even worse than that in low-power CPUs like Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use AVX512ER).

FP 划分与多个性能比的比在 Silvermont 和 Jaguar 等低功耗 CPU 中甚至在 Xeon Phi(KNL,您应该使用 AVX512ER)中更差。

Actual divide/multiply throughput ratios for scalar (non-vectorized) double: 8 on Ryzen and Skylake with their beefed-up dividers, but 16-28 on Haswell (data-dependent, and more likely towards the 28 cycle end unless your divisors are round numbers). These modern CPUs have very powerful dividers, but their 2-per-clock multiply throughput blows it away. (Even more so when your code can auto-vectorize with 256b AVX vectors). Also note that with the right compiler options, those multiply throughputs also apply to FMA.

标量(非矢量化)的实际除法/乘法吞吐量比double:Ryzen 和 Skylake 上的 8 与增强的除法器,但 16-28 在 Haswell 上(取决于数据,并且更有可能在第 28 个周期结束,除非您的除数是圆的)数)。这些现代 CPU 具有非常强大的分频器,但它们每时钟 2 倍的乘法吞吐量将其击败。(当您的代码可以使用 256b AVX 矢量自动矢量化时更是如此)。另请注意,使用正确的编译器选项,这些乘法吞吐量也适用于 FMA。

Numbers from http://agner.org/optimize/instruction tables for Intel Haswell/Skylake and AMD Ryzen, for SSE scalar (not including x87 fmul/ fdiv) and for 256b AVX SIMD vectors of floator double. See also the x86tag wiki.

来自http://agner.org/optimize/指令表的数字,适用于 Intel Haswell/Skylake 和 AMD Ryzen,适用于 SSE 标量(不包括 x87 fmul/ fdiv)以及适用于float或 的256b AVX SIMD 向量double。另请参阅x86标签维基。

回答by Michael Borgwardt

Division is inherently a much slower operation than multiplication.

除法本质上是比乘法慢得多的运算。

And this may in fact be something that the compiler cannot(and you may not want to) optimize in many cases due to floating point inaccuracies. These two statements:

这实际上可能是编译器无法(并且您可能不想)在许多情况下由于浮点不准确而优化的东西。这两种说法:

double d1 = 7 / 10.;
double d2 = 7 * 0.1;

are notsemantically identical - 0.1cannot be exactly represented as a double, so a slightly different value will end up being used - substituting the multiplication for the division in this case would yield a different result!

在语义上相同 -0.1不能完全表示为 a double,因此最终会使用稍微不同的值 - 在这种情况下用乘法代替除法会产生不同的结果!

回答by T.E.D.

Yes. Every FPU I am aware of performs multiplications much faster than divisions.

是的。我所知道的每个 FPU 执行乘法的速度都比除法快得多。

However, modern PCs are veryfast. They also contain pipelining archtectures that can make the difference negligable under many circumstances. To top it off, any decent compiler will perform the division operation you showed at compile timewith optimizations turned on. For your updated example, any decent compiler would perform that transformation itself.

但是,现代 PC速度非常快。它们还包含流水线架构,在许多情况下可以使差异忽略不计。最重要的是,任何体面的编译器都将执行您在编译时显示的除法运算并开启优化。对于您更新的示例,任何体面的编译器都会自行执行该转换。

So generally you should worry about making your code readable, and let the compiler worry about making it fast. Only if you have a measured speed issue with that line should you worry about perverting your code for the sake of speed. Compilers are well aware of what is faster than what on their CPU's, and are generally much better optimizers than you can ever hope to be.

所以通常你应该担心让你的代码可读,让编译器担心让它变得更快。只有当您在该线路上遇到速度问题时,您才应该担心为了速度而篡改您的代码。编译器很清楚什么比他们的 CPU 上的更快,并且通常是比您希望的更好的优化器。

回答by Nathan Whitehead

Think about what is required for multiplication of two n bit numbers. With the simplest method, you take one number x and repeatedly shift and conditionally add it to an accumulator (based on a bit in the other number y). After n additions you are done. Your result fits in 2n bits.

想想两个 n 位数字相乘需要什么。使用最简单的方法,您取一个数字 x 并重复移位并有条件地将其添加到累加器(基于另一个数字 y 中的一位)。添加 n 次后,您就完成了。您的结果适合 2n 位。

For division, you start with x of 2n bits and y of n bits, you want to compute x / y. The simplest method is long division, but in binary. At each stage you do a comparison and a subtraction to get one more bit of the quotient. This takes you n steps.

对于除法,您从 2n 位的 x 和 n 位的 y 开始,您要计算 x / y。最简单的方法是长除法,但采用二进制。在每个阶段,您都会进行比较和减法,以多获得一点商数。这需要你 n 步。

Some differences: each step of the multiplication only needs to look at 1 bit; each stage of the division needs to look at n bits during the comparison. Each stage of the multiplication is independent of all other stages (doesn't matter the order you add the partial products); for division each step depends on the previous step. This is a big deal in hardware. If things can be done independently then they can happen at the same time within a clock cycle.

一些区别:乘法的每一步只需要看1位;除法的每个阶段都需要在比较期间查看 n 位。乘法的每个阶段都独立于所有其他阶段(与您添加部分乘积的顺序无关);对于除法,每一步都取决于上一步。这在硬件上是一件大事。如果事情可以独立完成,那么它们可以在一个时钟周期内同时发生。

回答by ollj

Newton rhapson solves integer division in O(M(n)) complexity via linear algebra apploximation. Faster than The otherwise O(n*n) complexity.

Newton Rhapson 通过线性代数逼近以 O(M(n)) 的复杂度求解整数除法。快于否则 O(n*n) 复杂度。

In code The method contains 10mults 9adds 2bitwiseshifts.

在代码中,该方法包含 10mults 9adds 2bitwiseshifts。

This explains why a division is roughly 12x as many cpu ticks as a multiplication.

这解释了为什么除法大约是乘法的 12 倍 cpu 滴答。

回答by B?ови?

The answer depends on the platform for which you are programming.

答案取决于您编程的平台。

For example, doing lots of multiplication on an array on x86 should be much faster then doing division, because the compiler should create the assembler code which uses SIMD instructions. Since there are no division in the SIMD instructions, then you would see great improvements using multiplication then division.

例如,在 x86 上对数组进行大量乘法应该比除法快得多,因为编译器应该创建使用 SIMD 指令的汇编代码。由于 SIMD 指令中没有除法,因此使用乘法和除法后您会看到很大的改进。