C++ 使用位移位重新实现模数?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11076216/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 14:49:14  来源:igfitidea点击:

re implement modulo using bit shifts?

c++optimizationbit-manipulationmodulobit-shift

提问by PgrAm

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?

我正在为一个非常有限的系统编写一些代码,其中 mod 运算符非常慢。在我的代码中,模数需要每秒使用大约 180 次,我认为尽可能多地删除它会显着提高我的代码速度,截至目前,我的主循环的一个周期不会以 1/60 的速度运行第二个应该的。我想知道是否有可能只使用位移来重新实现模数,就像乘法和除法一样。所以这是我目前在 c++ 中的代码(如果我可以使用汇编执行模数,那就更好了)。如何在不使用除法或乘法的情况下删除模数?

    while(input > 0)
{
    out = (out << 3) + (out << 1);
    out += input % 10;

    input = (input >> 8) + (input >> 1);
}

EDIT:Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

编辑:实际上我意识到我需要每秒执行超过 180 次。输入的值可以是一个非常大的数字,最多 40 位。

回答by zxcdw

What you can do with simplebitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:

您可以使用简单的按位运算做的是通过与除数 1 进行 AND 运算来取值(被除数)的 2 次幂模数(除数)。几个例子:

unsigned int val = 123; // initial value
unsigned int rem;

rem = val & 0x3; // remainder after value is divided by 4. 
                 // Equivalent to 'val % 4'
rem = val % 5;   // remainder after value is divided by 5.
                 // Because 5 isn't power of two, we can't simply AND it with 5-1(=4). 

Why it works?Let's consider a bit pattern for the value 123 which is 1111011and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.

为什么有效?让我们考虑值 123 的位模式1111011,然后是除数 4,其位模式为00000100。正如我们现在所知,除数必须是 2 的幂(就像 4 一样),我们需要将它减一(十进制从 4 到 3),这会产生位模式00000011。在我们对原始 123 和 3 进行按位与运算后,得到的位模式将为00000011。结果是十进制的 3。我们需要 2 的幂的除数的原因是,一旦我们将它们减 1,我们将所有不太重要的位设置为1,其余为0。一旦我们进行按位与运算,它就会从原始值中“消除”更重要的位,并只剩下原始值除以除数的余数。

However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.

但是,除非您事先知道您的除数(在编译时,甚至需要特定于除数的代码路径),否则对任意除数应用这样的特定内容是行不通的 - 在运行时解决它是不可行的,尤其是在您的情况下性能很重要的地方。

Also there's a previous question related to the subjectwhich probably has interesting information on the matter from different points of view.

还有一个与该主题相关的先前问题,该问题可能从不同的角度提供了有关该问题的有趣信息。

回答by Voo

Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.

实际上,常量除法是众所周知的编译器优化,事实上,gcc 已经在这样做了。

This simple code snippet:

这个简单的代码片段:

int mod(int val) {
   return val % 10;
}

Generates the following code on my rather old gcc with -O3:

使用 -O3 在我相当旧的 gcc 上生成以下代码:

_mod:
        push    ebp
        mov     edx, 1717986919
        mov     ebp, esp
        mov     ecx, DWORD PTR [ebp+8]
        pop     ebp
        mov     eax, ecx
        imul    edx
        mov     eax, ecx
        sar     eax, 31
        sar     edx, 2
        sub     edx, eax
        lea     eax, [edx+edx*4]
        mov     edx, ecx
        add     eax, eax
        sub     edx, eax
        mov     eax, edx
        ret

If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.

如果您忽略功能结语/序言,基本上是两个 muls(确实在 x86 上我们很幸运,可以使用 lea 作为一个)以及一些转变和添加/订阅。我知道我已经在某处解释了这个优化背后的理论,所以我会在再次解释之前看看我是否能找到那个帖子。

Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient. Note that this code is notsize independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

现在在现代 CPU 上肯定比访问内存更快(即使您访问了缓存),但是对于您显然更古老的 CPU 来说它是否更快是一个只能通过基准测试来回答的问题(并且还要确保您的编译器正在执行那个优化,否则你总是可以在这里“窃取”gcc 版本;))。特别是考虑到它取决于有效的 mulhs(即乘法指令的较高位)才能有效。请注意,这个代码是没有大小无关-确切地说是一个神奇的数字变化(也许还有部分添加/班),但可以适应。

回答by Charlie Martin

Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.

使用位移位进行模 10 将变得困难和丑陋,因为位移位本质上是二进制的(在您今天将要运行的任何机器上)。如果您考虑一下,位移就是简单地乘以或除以 2。

But there's an obvious space-time trade you could make here: set up a table of values for outand out % 10and look it up. Then the line becomes

但是你可以在这里进行一个明显的时空交易:为out和建立一个值表out % 10并查找它。然后线变成

  out += tab[out]

and with any luck at all, that will turn out to be one 16-bit add and a store operation.

运气好的话,这将是一个 16 位加法和一个存储操作。

回答by Rafa? Rawicki

If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithmto your needs?

如果你想做模 10 和移位,也许你可以根据你的需要调整双游算法

This algorithm is used to convert binary numbers to decimal without using modulo or division.

该算法用于在不使用模或除法的情况下将二进制数转换为十进制数。

回答by Potatoswatter

Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.

每个 16 的幂都以 6 结尾。如果您将数字表示为 16 的幂之和(即,将其分解为 nybbles),那么每个项都以相同的方式对最后一位数字产生影响,除了个位。

0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10

Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.

请注意,6 = 5 + 1,如果有偶数个,则 5 会相互抵消。因此,只需将 nybbles(最后一个除外)相加,如果结果为奇数则加 5。

0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
                + 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
            = 28 % 10

This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).

这将 16 位、4-nybble 模数减少到最多一个数字0xF * 4 + 5 = 65。在二进制中,令人讨厌的是仍然有 3 个 nybbles,因此您需要重复该算法(尽管其中一个实际上并不重要)。

But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

但是 286 应该具有相当有效的 BCD 加法,您可以使用它来执行求和并一次性获得结果。(这需要手动将每个 nybble 转换为 BCD;我对平台的了解不够,无法说明如何优化它或是否有问题。)