在 C 中使用移位运算符的乘法和除法实际上更快吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6357038/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 19:59:04  来源:igfitidea点击:

Is multiplication and division using shift operators in C actually faster?

c++cdivisionmultiplicationbit-shift

提问by eku

Multiplication and division can be achieved using bit operators, for example

乘法和除法可以使用位运算符来实现,例如

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

and so on.

等等。

Is it actually faster to use say (i<<3)+(i<<1)to multiply with 10 than using i*10directly? Is there any sort of input that can't be multiplied or divided in this way?

使用 say(i<<3)+(i<<1)乘以 10实际上比i*10直接使用更快吗?是否有任何类型的输入不能以这种方式相乘或相除?

回答by Drew Hall

Short answer: Not likely.

简短回答:不太可能。

Long answer: Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

长答案:您的编译器中有一个优化器,它知道如何以您的目标处理器架构的能力尽可能快地进行乘法运算。你最好的办法是明确地告诉编译器你的意图(即 i*2 而不是 i << 1),让它决定最快的汇编/机器代码序列是什么。甚至有可能处理器本身已将乘法指令实现为微码中的移位和相加序列。

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

底线 - 不要花很多时间担心这个。如果你想转移,转移。如果你想乘,乘。做语义最清晰的事情——你的同事稍后会感谢你。或者,更有可能的是,如果你不这样做,以后会诅咒你。

回答by James Kanze

Just a concrete point of measure: many years back, I benchmarked two versions of my hashing algorithm:

只是一个具体的衡量点:多年前,我对散列算法的两个版本进行了基准测试:

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '
unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '
source           literal approach         optimised approach
#define N 0
int x;           .word x                xor registerA, registerA
x *= N;          move x -> registerA
                 move x -> registerB
                 A = B * immediate(0)
                 store registerA -> x
  ...............do something more with x...............
' ) { h = (h << 7) - h + (unsigned char)*s; ++ s; } return h; }
' ) { h = 127 * h + (unsigned char)*s; ++ s; } return h; }

and

int x;
x >> 1;   // divide by 2?

On every machine I benchmarked it on, the first was at least as fast as the second. Somewhat surprisingly, it was sometimes faster (e.g. on a Sun Sparc). When the hardware didn't support fast multiplication (and most didn't back then), the compiler would convert the multiplication into the appropriate combinations of shifts and add/sub. And because it knew the final goal, it could sometimes do so in less instructions than when you explicitly wrote the shifts and the add/subs.

在我对其进行基准测试的每台机器上,第一台至少与第二台一样快。有点令人惊讶的是,它有时更快(例如在 Sun Sparc 上)。当硬件不支持快速乘法(当时大多数不支持)时,编译器会将乘法转换为适当的移位和加/减组合。并且因为它知道最终目标,所以它有时可以用更少的指令来完成,而不是您明确编写移位和添加/订阅时。

Note that this was something like 15 years ago. Hopefully, compilers have only gotten better since then, so you can pretty much count on the compiler doing the right thing, probably better than you could. (Also, the reason the code looks so C'ish is because it was over 15 years ago. I'd obviously use std::stringand iterators today.)

请注意,这是 15 年前的事情。希望从那时起编译器只会变得更好,所以你几乎可以指望编译器做正确的事情,可能比你能做的更好。(另外,代码看起来如此 C'ish 的原因是因为它是 15 年前的。我std::string今天显然会使用和迭代器。)

回答by Eric Lippert

In addition to all the other good answers here, let me point out another reason to not use shift when you mean divide or multiply. I have never once seen someone introduce a bug by forgetting the relative precedence of multiplication and addition. I have seen bugs introduced when maintenance programmers forgot that "multiplying" via a shift is logicallya multiplication but not syntacticallyof the same precedence as multiplication. x * 2 + zand x << 1 + zare very different!

除了这里所有其他好的答案之外,让我指出另一个在您的意思是除法或乘法时不使用 shift 的原因。我从未见过有人通过忘记乘法和加法的相对优先级来引入错误。当维护程序员忘记通过移位“乘法”在逻辑上是乘法但在语法上与乘法的优先级不同时,我看到了引入的错误。x * 2 + z并且x << 1 + z非常不同!

If you're working on numbersthen use arithmetic operators like + - * / %. If you're working on arrays of bits, use bit twiddling operators like & ^ | >>. Don't mix them; an expression that has both bit twiddling and arithmetic is a bug waiting to happen.

如果您正在处理数字,则使用算术运算符,例如+ - * / %. 如果您正在处理位数组,请使用像& ^ | >>. 不要混合它们;一个既有位运算又有算术的表达式是一个等待发生的错误。

回答by Jens

This depends on the processor and the compiler. Some compilers already optimize code this way, others don't. So you need to check each time your code needs to be optimized this way.

这取决于处理器和编译器。一些编译器已经以这种方式优化代码,而另一些则没有。因此,每次需要以这种方式优化代码时,您都需要检查。

Unless you desperately need to optimize, I would not scramble my source code just to save an assembly instruction or processor cycle.

除非您迫切需要优化,否则我不会为了节省汇编指令或处理器周期而打乱我的源代码。

回答by Tony Delroy

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly?

使用 say (i<<3)+(i<<1) 乘以 10 实际上比直接使用 i*10 更快吗?

It might or might not be on your machine - if you care, measure in your real-world usage.

它可能在您的机器上,也可能不在您的机器上 - 如果您关心,请衡量您的实际使用情况。

A case study - from 486 to core i7

案例研究 - 从 486 到核心 i7

Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SALand http://www.penguin.cz/~literakl/intel/i.html#IMULwe get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.

基准测试很难有意义地进行,但我们可以看看一些事实。从http://www.penguin.cz/~literakl/intel/s.html#SALhttp://www.penguin.cz/~literakl/intel/i.html#IMUL我们了解了 x86 时钟周期需要用于算术移位和乘法。假设我们坚持使用“486”(最新列出的)、32 位寄存器和立即数,IMUL 需要 13-42 个周期,而 IDIV 需要 44。每个 SAL 需要 2,并加 1,所以即使其中一些一起移位,表面上看起来像一个赢家。

These days, with the core i7:

这些天,使用核心 i7:

(from http://software.intel.com/en-us/forums/showthread.php?t=61481)

(来自http://software.intel.com/en-us/forums/showthread.php?t=61481

The latency is 1 cycle for an integer addition and 3 cycles for an integer multiplication. You can find the latencies and thoughput in Appendix C of the "Intel? 64 and IA-32 Architectures Optimization Reference Manual", which is located on http://www.intel.com/products/processor/manuals/.

整数加法的延迟为1 个周期,整数乘法的延迟为3 个周期。您可以在http://www.intel.com/products/processor/manuals/上的“Intel? 64 and IA-32 Architectures Optimization Reference Manual”的附录 C 中找到延迟和吞吐量。

(from some Intel blurb)

(来自一些英特尔简介)

Using SSE, the Core i7 can issue simultaneous add and multiply instructions, resulting in a peak rate of 8 floating-point operations (FLOP) per clock cycle

使用 SSE,Core i7 可以同时发出加法和乘法指令,从而达到每个时钟周期 8 次浮点运算 (FLOP) 的峰值速率

That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus *- that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.

这让您了解事情已经发展到什么程度。*甚至在 90 年代就被认真对待的优化琐事——比如位移位对比——现在已经过时了。位移仍然更快,但对于非 2 次幂的 mul/div,当您完成所有移位并添加结果时,它又变慢了。然后,更多的指令意味着更多的缓存故障,更多的流水线潜在问题,更多地使用临时寄存器可能意味着更多地从堆栈中保存和恢复寄存器内容......它很快变得太复杂,无法明确量化所有影响,但它们是主要是负面的。

functionality in source code vs implementation

源代码与实现中的功能

More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support *and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specificcase the compiler's faced with.

更一般地说,您的问题被标记为 C 和 C++。作为第三代语言,它们专门设计用于隐藏底层 CPU 指令集的细节。为了满足他们的语言标准,他们必须支持乘法和移位操作(以及许多其他操作),即使底层硬件不支持。在这种情况下,他们必须使用许多其他指令来综合所需的结果。同样,如果 CPU 没有浮点运算并且没有 FPU,它们必须为浮点运算提供软件支持。现代 CPU 都支持*<<,所以这在理论上和历史上可能看起来很荒谬,但重要的是选择实现的自由是双向的:即使 CPU 有一条指令在一般情况下实现源代码中请求的操作,编译器也可以自由地执行选择它喜欢的其他东西,因为它更适合编译器面临的特定情况。

Examples (with a hypothetical assembly language)

示例(使用假设的汇编语言)

int a = ...;
int b = a * 10;

Instructions like exclusive or (xor) have no relationship to the source code, but xor-ing anything with itself clears all the bits, so it can be used to set something to 0. Source code that implies memory addresses may not entail any being used.

像exclusive or ( xor) 这样的指令与源代码没有关系,但是对任何东西进行异或运算会清除所有位,因此它可以用来将某些东西设置为0。暗示内存地址的源代码可能不需要使用任何东西。

These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.

只要计算机出现,这种黑客就一直在使用。在 3GL 的早期,为了确保开发人员采用编译器输出,必须满足现有的核心手工优化汇编语言开发人员。社区认为生成的代码不会更慢、更冗长或更糟。编译器很快采用了许多出色的优化——它们成为了一个比任何单独的汇编语言程序员都更好的集中存储库,尽管它们总是有可能错过在特定情况下碰巧至关重要的特定优化——人类有时可以把它弄出来并摸索更好的东西,而编译器只是按照他们的指示去做,直到有人将这种经验反馈给他们。

So, even if shifting and adding is still faster on some particular hardware, then the compiler writer's likely to have worked out exactly when it's both safe and beneficial.

因此,即使在某些特定硬件上移动和添加仍然更快,那么编译器编写者很可能会在既安全又有益的时候准确地计算出来。

Maintainability

可维护性

If your hardware changes you can recompile and it'll look at the target CPU and make another best choice, whereas you're unlikely to ever want to revisit your "optimisations" or list which compilation environments should use multiplication and which should shift. Think of all the non-power-of-two bit-shifted "optimisations" written 10+ years ago that are now slowing down the code they're in as it runs on modern processors...!

如果您的硬件发生变化,您可以重新编译,它会查看目标 CPU 并做出另一个最佳选择,而您不太可能想要重新审视您的“优化”或列出哪些编译环境应该使用乘法以及哪些应该转移。想想 10 多年前编写的所有非二次幂位移“优化”,现在它们在现代处理器上运行时减慢了它们所在的代码......!

Thankfully, good compilers like GCC can typically replace a series of bitshifts and arithmetic with a direct multiplication when any optimisation is enabled (i.e. ...main(...) { return (argc << 4) + (argc << 2) + argc; }-> imull $21, 8(%ebp), %eax) so a recompilation may help even without fixing the code, but that's not guaranteed.

值得庆幸的是,当启用任何优化(即...main(...) { return (argc << 4) + (argc << 2) + argc; }-> imull $21, 8(%ebp), %eax)时,像 GCC 这样的优秀编译器通常可以用直接乘法替换一系列位移和算术,因此即使不修复代码,重新编译也可能有所帮助,但这并不能保证。

Strange bitshifting code implementing multiplication or division is far less expressive of what you were conceptually trying to achieve, so other developers will be confused by that, and a confused programmer's more likely to introduce bugs or remove something essential in an effort to restore seeming sanity. If you only do non-obvious things when they're really tangibly beneficial, and then document them well (but don't document other stuff that's intuitive anyway), everyone will be happier.

实现乘法或除法的奇怪的位移代码远没有表达您在概念上试图实现的目标,因此其他开发人员会对此感到困惑,而困惑的程序员更有可能引入错误或删除一些重要的东西以努力恢复看似理智。如果你只在真正有益的时候才做不明显的事情,然后把它们记录下来(但无论如何不要记录其他直观的东西),每个人都会更快乐。

General solutions versus partial solutions

一般解决方案与部分解决方案

If you have some extra knowledge, such as that your intwill really only be storing values x, yand z, then you may be able to work out some instructions that work for those values and get you your result more quickly than when the compiler's doesn't have that insight and needs an implementation that works for all intvalues. For example, consider your question:

如果您有一些额外的知识,例如您int实际上只会存储值x, yand z,那么您可能能够制定一些适用于这些值的指令,并比编译器没有时更快地获得结果这种洞察力需要一个适用于所有int价值观的实现。例如,考虑您的问题:

Multiplication and division can be achieved using bit operators...

乘法和除法可以使用位运算符来实现...

You illustrate multiplication, but how about division?

你举例说明了乘法,但除法呢?

MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift

According to the C++ Standard 5.8:

根据 C++ 标准 5.8:

-3- The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

-3- E1 >> E2 的值是 E1 右移的 E2 位位置。如果 E1 具有无符号类型或如果 E1 具有有符号类型和非负值,则结果的值是 E1 的商除以数量 2 的 E2 次幂的整数部分。如果 E1 具有有符号类型和负值,则结果值是实现定义的。

So, your bit shift has an implementation defined result when xis negative: it may not work the same way on different machines. But, /works far more predictably.(It may not be perfectlyconsistent either, as different machines may have different representations of negative numbers, and hence different ranges even when there are the same number of bits making up the representation.)

因此,当x为负时,您的位移位具有实现定义的结果:它在不同的机器上可能不会以相同的方式工作。但是,/工作更可预测。(它也可能不完全一致,因为不同的机器可能有不同的负数表示,因此即使由相同数量的位组成表示也有不同的范围。)

You may say "I don't care... that intis storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >>safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's riskyand rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".

您可能会说“我不在乎......这int是存储员工的年龄,它永远不会是负数”。如果您有这种特殊的洞察力,那么是的 - 您的>>安全优化可能会被编译器传递,除非您在代码中明确地这样做。但是,这是有风险的,而且很少有用,因为很多时候你不会有这种洞察力,而其他开发相同代码的程序员不会知道你已经把赌注押在了对数据的一些不寻常的期望上”将处理......由于您的“优化”,对他们来说似乎完全安全的更改可能会适得其反。

Is there any sort of input that can't be multiplied or divided in this way?

是否有任何类型的输入不能以这种方式相乘或相除?

Yes... as mentioned above, negative numbers have implementation defined behaviour when "divided" by bit-shifting.

是的......如上所述,负数在被位移“除”时具有实现定义的行为。

回答by user703016

Just tried on my machine compiling this :

刚在我的机器上试过编译这个:

include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient

When disassembling it produces output :

拆卸时会产生输出:

mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
    mul edx
.endif
shr edx,16

This version is faster than your hand-optimized code with pure shifting and addition.

此版本比您手动优化的纯移位和加法代码更快。

You really never know what the compiler is going to come up with, so it's better to simply write a normalmultiplication and let him optimize the way he wants to, except in very precise cases where you knowthe compiler cannot optimize.

你真的永远不知道编译器会想出什么,所以最好简单地写一个普通的乘法,让他按照他想要的方式优化,除非在你知道编译器无法优化的非常精确的情况下。

回答by Mike Kwan

Shifting is generally a lot faster than multiplying at an instruction level but you may well be wasting your time doing premature optimisations. The compiler may well perform these optimisations at compiletime. Doing it yourself will affect readability and possibly have no effect on performance. It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck.

在指令级别,移位通常比乘法快得多,但您很可能会浪费时间进行过早的优化。编译器很可能在编译时执行这些优化。自己做会影响可读性,可能对性能没有影响。如果您已经分析并发现这是一个瓶颈,那么做这样的事情可能是值得的。

Actually the division trick, known as 'magic division' can actually yield huge payoffs. Again you should profile first to see if it's needed. But if you do use it there are useful programs around to help you figure out what instructions are needed for the same division semantics. Here is an example : http://www.masm32.com/board/index.php?topic=12421.0

实际上,被称为“魔术师”的除法技巧实际上可以产生巨大的收益。同样,您应该首先进行概要分析以查看是否需要它。但是,如果您确实使用它,那么周围有一些有用的程序可以帮助您找出相同除法语义所需的指令。这是一个例子:http: //www.masm32.com/board/index.php?topic=12421.0

An example which I have lifted from the OP's thread on MASM32:

我从 MASM32 上的 OP 线程中举出的一个示例:

#include <stdio.h>

int main(void)
{
    int i;

    for (i = 5; i >= -5; --i)
    {
        printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
    }
    return 0;
}

Would generate:

会产生:

5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3

回答by Paul R

Shift and integer multiply instructions have similar performance on most modern CPUs - integer multiply instructions were relatively slow back in the 1980s but in general this is no longer true. Integer multiply instructions may have higher latency, so there may still be cases where a shift is preferable. Ditto for cases where you can keep more execution units busy (although this can cut both ways).

移位和整数乘法指​​令在大多数现代 CPU 上具有相似的性能——整数乘法指​​令在 1980 年代相对较慢,但通常情况下不再如此。整数乘法指​​令可能有更高的延迟,因此可能仍然存在移位更可取的情况。对于可以保持更多执行单元忙碌的情况(尽管这可以双向切割)。

Integer division is still relatively slow though, so using a shift instead of division by a power of 2 is still a win, and most compilers will implement this as an optimisation. Note however that for this optimisation to be valid the dividend needs to be either unsigned or must be known to be positive. For a negative dividend the shift and divide are not equivalent!

尽管如此,整数除法仍然相对较慢,因此使用移位而不是除以 2 的幂仍然是一个胜利,并且大多数编译器将其作为优化来实现。但是请注意,要使此优化有效,股息必须是无符号的或必须已知为正数。对于负红利,移位和除法不等价!

##代码##

Output:

输出:

##代码##

So if you want to help the compiler then make sure the variable or expression in the dividend is explicitly unsigned.

因此,如果您想帮助编译器,请确保被除数中的变量或表达式是显式无符号的。

回答by Brady Moritz

It completely depends on target device, language, purpose, etc.

它完全取决于目标设备、语言、用途等。

Pixel crunching in a video card driver? Very likely, yes!

视频卡驱动程序中的像素处理?很有可能,是的!

.NET business application for your department? Absolutely no reason to even look into it.

您部门的 .NET 业务应用程序?绝对没有理由甚至调查它。

For a high performance game for a mobile device it might be worth looking into, but only after easier optimizations have been performed.

对于移动设备的高性能游戏,它可能值得研究,但只有在执行更简单的优化之后。

回答by Kromster

Don't do unless you absolutely need to and your code intent requires shifting rather than multiplication/division.

除非您绝对需要并且您的代码意图需要移位而不是乘法/除法,否则不要这样做。

In typical day - you could potentialy save few machine cycles (or loose, since compiler knows better what to optimize), but the cost doesn't worth it - you spend time on minor details rather than actual job, maintaining the code becomes harder and your co-workers will curse you.

在典型的日子里——你可能会节省一些机器周期(或松散的,因为编译器更知道要优化什么),但成本不值得——你花时间在小细节上而不是实际工作上,维护代码变得更加困难和你的同事会诅咒你。

You might need to do it for high-load computations, where each saved cycle means minutes of runtime. But, you should optimize one place at a time and do performance tests each time to see if you really made it faster or broke compilers logic.

您可能需要为高负载计算执行此操作,其中每个保存的周期都意味着几分钟的运行时间。但是,你应该一次优化一个地方,每次都做性能测试,看看你是否真的让它更快或破坏了编译器逻辑。