C++ 整数乘法真的与现代 CPU 上的加法速度相同吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21819682/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 23:50:04  来源:igfitidea点击:

Is integer multiplication really done at the same speed as addition on a modern CPU?

c++performancecpumultiplicationaddition

提问by exebook

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?

我经常听到这样的说法,现代硬件上的乘法是如此优化,以至于它实际上与加法的速度相同。真的吗?

I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:

我永远无法得到任何权威的确认。我自己的研究只会增加问题。速度测试通常会显示让我感到困惑的数据。下面是一个例子:

#include <stdio.h>
#include <sys/time.h>

unsigned int time1000() {
    timeval val;
    gettimeofday(&val, 0);
    val.tv_sec &= 0xffff;
    return val.tv_sec * 1000 + val.tv_usec / 1000;
}

int main() {
    unsigned int sum = 1, T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i + (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
    sum = 1;
    T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i * (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
}

The code above can show that multiplication is faster:

上面的代码可以显示乘法更快:

clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456

But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

但是对于其他编译器、其他编译器参数、不同编写的内部循环,结果可能会有所不同,我什至无法获得近似值。

采纳答案by user541686

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.

两个n位数字的乘法实际上可以在 O(log n) 电路深度中完成,就像加法一样。

Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for boththe "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.

另外在O(log n)的被分割完成一半,并加入两个部分中的数字(递归的)平行,其中,上半部分解决了两个中的“0-进位”和“1进位”的情况。添加下半部分后,将检查进位,并使用其值在 0 进位和 1 进位情况之间进行选择。

Multiplication in O(log n) depth is alsodone through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead"and "carry-save"addition.

O(log n) 深度的乘法也是通过并行化完成的,其中 3 个数字的每一个总和被减少到只有 2 个数字的总和,并且总和以某种方式完成,如上述。
我不会在这里解释它,但是您可以通过查找“进位前瞻”“进位保存”加法来找到有关快速加法和乘法的阅读材料。

So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

因此,从理论的角度来看,由于电路显然本质上是并行的(与软件不同),因此乘法渐近变慢的唯一原因是前面的常数因子,而不是渐近复杂性。

回答by Cory Nelson

No, they're not the same speed. Who told you that?

不,它们的速度不一样。谁告诉你的?

Agner Fog's instruction tablesshow that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

Agner Fog 的指令表显示,当使用 32 位整数寄存器时,Haswell 的 ADD/SUB 需要 0.25-1 个周期(取决于您的指令的流水线程度),而 MUL 需要 2-4 个周期。浮点数正好相反:ADDSS/SUBSS 需要 1-3 个周期,而 MULSS 需要 0.5-5 个周期。

回答by trumpetlicks

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.

这是一个比简单的乘法与加法更复杂的答案。实际上,答案很可能永远不会是肯定的。乘法,电子上,是一个更复杂的电路。大多数原因是,乘法是乘法步骤后跟加法步骤的行为,请记住在使用计算器之前乘以十进制数是什么感觉。

The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).

要记住的另一件事是,乘法将花费更长或更短的时间,具体取决于您运行它的处理器的架构。这可能只是公司特定的,也可能不是。虽然 AMD 很可能与 Intel 不同,但即使是 Intel i7 也可能与 core 2 不同(在同一代内),并且在几代之间肯定不同(尤其是越往后)。

In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.

在所有技术领域,如果乘法是你唯一要做的事情(没有循环、计数等),乘法将是 2 到(正如我在 PPC 架构上看到的)慢 35 倍。这更像是了解您的架构和电子产品的练习。

In Addition:It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.

另外:应该注意的是,可以构建一个处理器,包括乘法在内的所有操作都需要一个时钟。该处理器必须做的是,摆脱所有流水线,并减慢时钟,以便任何 OP 电路的硬件延迟小于或等于时钟时序提供的延迟。

To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.

这样做将消除我们在将流水线添加到处理器中时能够获得的固有性能提升。流水线是将任务分解为可以更快执行的更小的子任务的想法。通过在子任务之间存储和转发每个子任务的结果,我们现在可以运行更快的时钟频率,只需要允许子任务的最长延迟,而不是整个任务的延迟。

Picture of time through a multiply:

通过乘法的时间图片:

|--------------------------------------------------| Non-Pipelined

|------------------------------------------------- -| 非流水线

|--Step 1--|--Step 2--|--Step 3--|--Step 4--|--Step 5--| Pipelined

|--步骤1--|--步骤2--|--步骤3--|--步骤4--|--步骤5--| 流水线式

In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.

在上图中,非流水线电路需要 50 个单位的时间。在流水线版本中,我们将 50 个单元分成 5 个步骤,每个步骤花费 10 个时间单位,中间有一个存储步骤。非常重要的是要注意,在流水线示例中,每个步骤都可以完全独立地并行工作。要完成一个操作,它必须按顺序通过所有 5 个步骤,但另一个具有操作数的相同操作可以在步骤 2 中进行,就像在步骤 1、3、4 和 5 中一样。

With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.

综上所述,这种流水线方法允许我们在每个时钟周期连续填充操作符,并在每个时钟周期得到结果如果我们能够对我们的操作进行排序,这样我们就可以在切换之前执行所有一个操作到另一个操作,我们所认为的时序命中是将第一个操作从流水线中取出所需的原始时钟量。

Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.

Mystical 提出了另一个优点。从更多系统的角度来看架构也很重要。确实,较新的 Haswell 架构旨在提高处理器内的浮点乘法性能。出于这个原因,作为系统级,它的架构允许多个乘法同时发生,而不是每个系统时钟只能发生一次的加法。

All of this can be summed up as follows:

所有这些都可以总结如下:

  1. Each architecture is different from a lower level HW perspective as well as from a system perspective
  2. FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
  3. Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.
  1. 每个架构从较低级别的硬件角度以及从系统角度来看都是不同的
  2. 从功能上讲,乘法总是比加法花费更多的时间,因为它结合了真正的乘法和真正的加法步骤。
  3. 了解您试图在其上运行代码的架构,并在可读性和从该架构中真正获得最佳性能之间找到正确的平衡点。

回答by Peter Cordes

Intel since Haswell has

英特尔自 Haswell 以来

  • addperformance of 4/clock throughput, 1 cycle latency. (Any operand-size)
  • imulperformance of 1/clock throughput, 3 cycle latency. (Any operand-size)
  • add4/clock 吞吐量,1 周期延迟的性能。(任何操作数大小)
  • imul1/clock 吞吐量,3 周期延迟的性能。(任何操作数大小)

Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/and other links in https://stackoverflow.com/tags/x86/info

锐龙是类似的。Bulldozer 系列具有低得多的整数吞吐量和非完全流水线乘法,包括 64 位操作数大小乘法的额外缓慢。见https://agner.org/optimize/和其他环节https://stackoverflow.com/tags/x86/info

But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n)and can recognize some loops that do that.

但是一个好的编译器可以自动向量化你的循环。(SIMD 整数乘法吞吐量和延迟都比 SIMD 整数加法差)。或者简单地通过它们不断传播以打印出答案!Clang 确实知道闭式高斯公式,sum(i=0..n)并且可以识别一些这样做的循环。



You forgot to enable optimizationso both loops bottleneck on the ALU + store/reload latencyof keeping sumin memory between each of sum += independent stuffand sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?for more about just how bad the resulting asm is, and why that's the case. clang++defaults to -O0(debug mode: keep variables in memory where a debugger can modify them between any C++ statements).

您忘记启用优化,因此两个循环都在 ALU +存储/重新加载sum每个sum += independent stuff和之间保留在内存中的延迟上成为瓶颈sum++。请参阅为什么 clang 使用 -O0 产生效率低下的 asm(对于这个简单的浮点和)?了解更多关于生成的 asm 有多糟糕,以及为什么会这样。 clang++默认为-O0(调试模式:将变量保存在内存中,调试器可以在任何 C++ 语句之间修改它们)。

Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU addin there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).

像 Sandybridge 系列(包括 Haswell 和 Skylake)这样的现代 x86 上的存储转发延迟大约为 3 到 5 个周期,具体取决于重新加载的时间。因此,add在那里也有一个 1 周期延迟 ALU ,您正在查看此循环的关键路径中大约两个 6 周期延迟步骤。(大量隐藏所有基于 的存储/重新加载和计算i,以及循环计数器更新)。

See also Adding a redundant assignment speeds up code when compiled without optimizationfor another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.

另请参阅添加冗余分配可在未针对另一个非优化基准进行优化的情况下编译时加速代码。在那种情况下,通过在循环中拥有更多独立工作,延迟重新加载尝试,实际上减少了存储转发延迟。



Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.

现代 x86 CPU 具有 1/clock 倍增吞吐量,因此即使进行了优化,您也不会从中看到吞吐量瓶颈。或者在推土机系列上,没有完全流水线化,每 2 个时钟吞吐量 1 个。

More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.

您更有可能在每个周期发布所有工作的前端工作中遇到瓶颈。

Although leadoes allow very efficient copy-and-add, and doing i + i + 1with a single instruction. Although really a good compiler would see that the loop only uses 2*iand optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.

虽然lea确实允许非常有效的复制和添加,并且i + i + 1只需一条指令。虽然真正好的编译器会看到循环只使用2*i和优化以增加 2。即强度减少以重复加 2 而不必在循环内移动。

And of course with optimization the extra sum++can just fold into the sum += stuffwhere stuffalready includes a constant. Not so with the multiply.

当然与优化的额外sum++只需折叠成sum += stuff,其中stuff已经包含了一个常数。乘法不是这样。

回答by user3684405

A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:

乘法需要最后一步加法,至少是相同大小的数字;所以它比添加需要更长的时间。十进制:

    123
    112
   ----
   +246  ----
   123      | matrix generation  
  123    ----
  -----
  13776 <---------------- Addition

Same applies in binary, with a more elaborate reduction of the matrix.

同样适用于二进制,对矩阵进行更精细的缩减。

That said, reasons why they may take the same amount of time:

也就是说,它们可能需要相同时间的原因:

  1. To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
  2. Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.
  1. 为了简化流水线架构,所有常规指令都可以设计为采用相同数量的周期(例如内存移动除外,这取决于与外部内存通信所需的时间)。
  2. 由于乘法器最后一步的加法器就像加法指令的加法器......为什么不通过跳过矩阵生成和减少来使用相同的加法器?如果他们使用相同的加法器,那么显然他们将花费相同的时间。

Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.

当然,在更复杂的架构中,情况并非如此,您可能会获得完全不同的值。您还有一些体系结构,当它们不相互依赖时,并行执行多条指令,然后您就有点受编译器和操作系统的支配。

The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.

严格运行此测试的唯一方法必须在没有操作系统的情况下在汇编中运行 - 否则变量太多。

回答by Kenneth C Richardson

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.

我来到这个线程是为了了解现代处理器在整数数学方面正在做什么以及执行它们所需的周期数。我在 1990 年代研究了在 65c816 处理器上加速 32 位整数乘法和除法的问题。使用下面的方法,我能够将当时 ORCA/M 编译器中可用的标准数学库的速度提高三倍。

So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.

因此,乘法比加法快的想法并非如此(除非很少),但正如人们所说,这取决于架构的实现方式。如果在时钟周期之间执行足够多的步骤,是的,乘法可以有效地与基于时钟的加法具有相同的速度,但是会浪费很多时间。在这种情况下,如果给定一条指令和多个值,那么有一条指令可以执行多次(相关)加/减运算。一个人可以做梦。

On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:

在 65c816 处理器上,没有乘法或除法指令。Mult 和 Div 完成了轮班和加法。
要执行 16 位加法,您将执行以下操作:

LDA 
multiply by 2  (0000 0010)
LDA 
multiply by (0101 1010)
LDA ##代码##00 - loaded a value into the Accumulator  (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC ##代码##00 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ADC ##代码##00 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC ##代码##00 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
STA ##代码##04 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead
00 - loaded a value into the Accumulator (5 cycles). ASL - shift left (2 cycles). STA ##代码##04 - store the value in the Accumulator back to memory (5 cycles). 12 cycles plus call overhead.
00 - loaded a value into the Accumulator (5 cycles) ADC ##代码##02 - add with carry (5 cycles) STA ##代码##04 - store the value in the Accumulator back to memory (5 cycles) 15 cycles total for an add

If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.

如果处理来自 C 的调用,您将有额外的开销来处理从堆栈中推入和拉出值。例如,创建一次执行两个倍数的例程将节省开销。

The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.

传统的乘法方法是对一个数的整个值进行移位和相加。每次向左移动时进位变为 1 就意味着您需要再次添加该值。这需要对每一位进行测试并对结果进行移位。

I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.

我用一个包含 256 个项目的查找表替换了它,这样就不需要检查进位位了。还可以在进行乘法之前确定溢出,以免浪费时间。(在现代处理器上,这可以并行完成,但我不知道他们是否在硬件中执行此操作)。给定两个 32 位数字和预筛选溢出,其中一个乘法器始终为 16 位或更少,因此只需运行一次或两次 8 位乘法即可执行整个 32 位乘法。这样做的结果是乘法速度提高了 3 倍。

the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles

16位乘法的速度从12个周期到大约37个周期不等

##代码## ##代码##

Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.

由于 AppleIIgs 的数据总线只有 8 位宽,加载 16 位值需要 5 个周期从内存加载,一个额外周期用于指针,另一个周期用于第二个字节。

LDA instruction (1 cycle because it is an 8 bit value) $0000 (16 bit value requires two cycles to load) memory location (requires two cycles to load because of an 8 bit data bus)

LDA 指令(1 个周期,因为它是一个 8 位值) $0000(16 位值需要两个周期来加载) 内存位置(因为 8 位数据总线需要两个周期来加载)

Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.

现代处理器能够更快地做到这一点,因为它们在最坏的情况下具有 32 位数据总线。在处理器逻辑本身中,与数据总线延迟相比,门系统根本没有额外的延迟,因为整个值将立即加载。

To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.

要进行完整的 32 位乘法,您需要执行上述两次并将结果相加以获得最终答案。现代处理器应该能够并行执行这两项操作,并为答案添加结果。结合并行完成的溢出预检查,它将最大限度地减少乘法所需的时间。

Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.

无论如何,很明显乘法比加法需要更多的努力。在 cpu 时钟周期之间处理操作的步骤数将决定需要多少个时钟周期。如果时钟足够慢,那么加法看起来与乘法的速度相同。

Regards, Ken

问候, 肯

回答by mathreadler

Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.

即使是这样,这也主要告诉我们时钟对我们的硬件施加了什么限制。由于热(?),我们不能时钟更高,但是信号在时钟期间可以通过的 ADD 指令门的数量可能非常多,但单个 ADD 指令只能使用其中一个。因此,虽然在某些时候可能需要同样多的时钟周期,但并不是所有的信号传播时间都被利用。

If we could clock higher we could def. make ADD faster probably by several orders of magnitude.

如果我们可以时钟更高,我们可以定义。使 ADD 速度可能快几个数量级。

回答by cmaster - reinstate monica

This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.

这真的取决于你的机器。当然,整数乘法相对于加法来说是相当复杂的,但是相当多的 AMD CPU 可以在一个周期内执行一次乘法。这和加法一样快。

Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).

其他CPU做乘法需要三四个周期,比加法慢一点。但这远不及十年前您不得不遭受的性能损失(当时,在某些 CPU 上,32 位乘法可能需要 30 多个周期)。

So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.

所以,是的,乘法现在处于相同的速度等级,但是不,它仍然不如所有 CPU 上的加法快。

回答by user541686

No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).

不,它不是,事实上它明显变慢了(这转化为我正在运行的特定实际程序的 15% 的性能损失)。

I realized this myself when asking this question from just a few days ago here.

几天前在这里问这个问题时,我自己意识到了这一点。

回答by Brian Johnson

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.

由于其他答案涉及真实的现代设备——随着时间的推移,这些设备必然会发生变化和改进——我认为我们可以从理论方面来看待这个问题。

Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.

命题:当在逻辑门中实现时,使用通常的算法,整数乘法电路比加法电路慢 O(log N) 倍,其中 N 是一个字中的位数。

Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.

证明:组合电路稳定的时间与从任何输入到任何输出的最长逻辑门序列的深度成正比。所以我们必须证明小学乘法电路比加法电路深 O(log N) 倍。

Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)

加法通常被实现为一个半加器,然后是 N-1 个全加器,进位位从一个加法器链接到下一个。该电路显然具有深度 O(N)。(该电路可以通过多种方式进行优化,但最坏情况下的性能将始终为 O(N),除非使用异常大的查找表。)

To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.

要将 A 乘以 B,我们首先需要将 A 的每一位与 B 的每一位相乘。每个按位乘法只是一个与门。有 N^2 个按位乘法要执行,因此有 N^2 个与门——但它们都可以并行执行,电路深度为 1。这解决了小学算法的乘法阶段,只留下加法阶段。

In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

在加法阶段,我们可以使用反向二叉树形电路组合部分乘积,并行执行许多加法。树将是 (log N) 个节点深,在每个节点,我们将用 O(N) 位将两个数字相加。这意味着每个节点都可以使用深度为 O(N) 的加法器来实现,从而得到 O(N log N) 的总电路深度。QED。