C语言 尽可能避免使用 mod 运算符更好吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15596318/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is it better to avoid using the mod operator when possible?
提问by limp_chimp
I assume that calculating the modulus of a number is a somewhat expensive operation, at least compared to simple arithmetic tests (such as seeing if a number exceeds the length of an array). If this is indeed the case, is it more efficient to replace, for example, the following code:
我认为计算一个数字的模数是一个有点昂贵的操作,至少与简单的算术测试相比(例如查看一个数字是否超过数组的长度)。如果确实是这种情况,是否更有效地替换,例如,以下代码:
res = array[(i + 1) % len];
with the following? :
与以下?:
res = array[(i + 1 == len) ? 0 : i + 1];
The first one is easier on the eyes, but I wonder if the second might be more efficient. If so, might I expect an optimizing compiler to replace the first snippet with the second, when a compiled language is used?
第一个看起来更容易,但我想知道第二个是否更有效。如果是这样,当使用编译语言时,我是否希望优化编译器将第一个片段替换为第二个片段?
Of course, this "optimization" (if it is indeed an optimization) doesn't work in all cases (in this case, it only works if i+1is never more than len).
当然,这种“优化”(如果它确实是优化)并非在所有情况下都有效(在这种情况下,它仅在i+1不超过时才有效len)。
回答by NPE
My general advice is as follows. Use whichever version you think is easier on the eye, and then profile your entire system. Only optimize those parts of the code that the profiler flags up as bottlenecks. I'll bet my bottom dollar that the modulo operator isn't going to be among them.
我的一般建议如下。使用您认为更容易理解的任何版本,然后分析您的整个系统。仅优化分析器标记为瓶颈的代码部分。我敢打赌,模运算符不会在其中。
As far as the specific example goes, only benchmarking can tell which is faster on your specific architecture using your specific compiler. You are potentially replacing modulo with branching, and it's anything but obvious which would be faster.
就具体示例而言,只有基准测试才能判断使用特定编译器在特定架构上哪个更快。您可能会用branching替换 modulo ,而且显然哪个会更快。
回答by dpi
Some simple measurement:
一些简单的测量:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
int test = atoi(argv[1]);
int divisor = atoi(argv[2]);
int iterations = atoi(argv[3]);
int a = 0;
if (test == 0) {
for (int i = 0; i < iterations; i++)
a = (a + 1) % divisor;
} else if (test == 1) {
for (int i = 0; i < iterations; i++)
a = a + 1 == divisor ? 0 : a + 1;
}
printf("%d\n", a);
}
Compiling with either gcc or clang with -O3, and running time ./a.out 0 42 1000000000(modulo version) or time ./a.out 1 42 1000000000(comparison version) results in
使用 gcc 或 clang 进行编译-O3,并运行time ./a.out 0 42 1000000000(modulo version) 或time ./a.out 1 42 1000000000(comparison version) 结果
- 6.25 secondsuser runtime for the modulo version,
- 1.03 secondsfor the comparison version.
- 模数版本的用户运行时间为6.25 秒,
- 比较版本为1.03 秒。
(using gcc 5.2.1 or clang 3.6.2; Intel Core i5-4690K @ 3.50GHz; 64-bit Linux)
(使用 gcc 5.2.1 或 clang 3.6.2;英特尔酷睿 i5-4690K @ 3.50GHz;64 位 Linux)
This means that it is probably a good idea to use the comparison version.
这意味着使用比较版本可能是个好主意。
回答by Michel Billaud
Well, have a look at 2 ways to get the next value of a "modulo 3" cyclic counter.
好吧,看一下获取“模 3”循环计数器的下一个值的 2 种方法。
int next1(int n) {
return (n + 1) % 3;
}
int next2(int n) {
return n == 2 ? 0 : n + 1;
}
I've compiled it with gcc -O3 option (for the common x64 architecture), and -s to get the assembly code.
我已经用 gcc -O3 选项(对于常见的 x64 架构)和 -s 编译它以获取汇编代码。
The code for the first function does some unexplainable magic (*) to avoid a division, using a multiplication anyway:
第一个函数的代码做了一些无法解释的魔法 (*) 来避免除法,无论如何都使用乘法:
addl , %edi
movl 31655766, %edx
movl %edi, %eax
imull %edx
movl %edi, %eax
sarl , %eax
subl %eax, %edx
leal (%rdx,%rdx,2), %eax
subl %eax, %edi
movl %edi, %eax
ret
And is much longer (and I bet slower) than the second function:
并且比第二个函数长得多(我敢打赌更慢):
leal 1(%rdi), %eax
cmpl , %edi
movl addl , %edi
movl %edi, %edx
sarl , %edx
shrl , %edx
leal (%rdi,%rdx), %eax
andl , %eax
subl %edx, %eax
ret
, %edx
cmove %edx, %eax
ret
So it is not always true that "the (modern) compiler does a better job than you anyway".
因此,“(现代)编译器无论如何都比您做得更好”并不总是正确的。
Interestingly, the same experiment with 4 instead of 3 leads to a and-masking for the first function
有趣的是,同样的实验用 4 而不是 3 导致第一个函数的 and-masking
int next3(int n) {
return (n + 1) & 3;;
}
but it is still, and by large, inferior to the second version.
但总的来说,它仍然不如第二个版本。
Being more explicit about proper ways to do the things
更明确地说明做事的正确方法
leal 1(%rdi), %eax
andl , %eax
ret
yields much better results :
产生更好的结果:
#include <stdio.h>
#include <stdlib.h>
#define modulo
int main()
{
int iterations = 1000000000;
int size = 16;
int a[size];
unsigned long long res = 0;
int i, j;
for (i=0;i<size;i++)
a[i] = i;
for (i=0,j=0;i<iterations;i++)
{
j++;
#ifdef modulo
j %= size;
#else
if (j >= size)
j = 0;
#endif
res += a[j];
}
printf("%llu\n", res);
}
(*) well, not that complicated. Multiplication by reciprocical. Compute the integer constant K = (2^N)/3, for some large enough value of N. Now, when you want the value of X/3, instead of a division by 3, compute X*K, and shift it N positions to the right.
(*) 好吧,没那么复杂。乘以倒数。计算整数常数 K = (2^N)/3,对于一些足够大的 N 值。 现在,当你想要 X/3 的值时,而不是除以 3,计算 X*K,并将其移位 N位置在右边。
回答by J. Doe
If 'len' in your code is big enough, then the conditional will be faster, as the branch predictors will nearly always guess correctly.
如果代码中的“len”足够大,那么条件会更快,因为分支预测器几乎总是能正确猜测。
If not, then I believe this is closely connected to circular queues, where it is often the case that the length is a power of 2. This will enable the compiler to substitute modulo with a simple AND.
如果不是,那么我相信这与循环队列密切相关,在这种情况下,长度通常是 2 的幂。这将使编译器能够用简单的 AND 代替模数。
The code is the following:
代码如下:
##代码##size=15:
大小=15:
- modulo: 4,868s
- cond: 1,291s
- 模数:4,868s
- 条件:1,291 秒
size=16:
大小=16:
- modulo: 1,067s
- cond: 1,599s
- 模数:1,067s
- 条件:1,599 秒
Compiled in gcc 7.3.0 , with -O3 optimization. The machine is an i7 920.
在 gcc 7.3.0 中编译,使用 -O3 优化。机器是i7 920。
回答by seand
Modulo can be done with a single processor instruction on most architectures (ex. DIV on x86). However it's likely a premature optimization for what you need.
在大多数体系结构(例如 x86 上的 DIV)上,模数可以通过单个处理器指令完成。但是,这可能是您需要的过早优化。

