Java 为什么 if (variable1 % variable2 == 0) 效率低下?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54405842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why is if (variable1 % variable2 == 0) inefficient?
提问by Robert Cotterman
I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MASSIVE decrease in performance, when I used modulus as variable % variable
vs variable % 5000
or whatnot. Can someone explain to me why this is and what is causing it? So I can be better...
我是 Java 新手,昨晚运行了一些代码,这让我很困扰。我正在构建一个简单的程序来显示 for 循环中的每个 X 输出,当我使用模数作为variable % variable
vsvariable % 5000
或诸如此类时,我注意到性能大幅下降。有人可以向我解释为什么会这样以及是什么原因造成的吗?这样我才能更好...
Here is the "efficient" code (sorry if I get a bit of syntax wrong I'm not on the computer with the code right now)
这是“高效”的代码(对不起,如果我的语法有点错误,我现在不在带有代码的计算机上)
long startNum = 0;
long stopNum = 1000000000L;
for (long i = startNum; i <= stopNum; i++){
if (i % 50000 == 0) {
System.out.println(i);
}
}
Here is the "inefficient code"
这是“低效代码”
long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 50000;
for (long i = startNum; i <= stopNum; i++){
if (i % progressCheck == 0) {
System.out.println(i);
}
}
Mind you I had a date variable to measure the differences, and once it became long enough, the first one took 50ms while the other took 12 seconds or something like that. You may have to increase the stopNum
or decrease the progressCheck
if your PC is more efficient than mine or what not.
请注意,我有一个日期变量来测量差异,一旦它变得足够长,第一个需要 50 毫秒,而另一个需要 12 秒或类似的时间。如果您的 PC 比我的效率更高,您可能需要增加stopNum
或减少它progressCheck
。
I looked for this question around the web, but I can't find an answer, maybe I'm just not asking it right.
我在网上寻找这个问题,但找不到答案,也许我只是问得不对。
EDIT: I did not expect my question to be so popular, I appreciate all the answers. I did perform a benchmark on each half in time taken, and the inefficient code took considerably longer, 1/4 of a second vs 10 seconds give or take. Granted they are using println, but they are both doing the same amount, so I wouldn't imagine that would skew it much, especially since the discrepancy is repeatable. As for the answers, since I am new to Java, I will let votes decide for now which answer is best. I will try to pick one by Wednesday.
编辑:我没想到我的问题如此受欢迎,我感谢所有的答案。我确实对所用时间的每一半进行了基准测试,而低效代码所用的时间要长得多,1/4 秒 vs 10 秒。授予他们使用 println,但他们都做相同的数量,所以我不认为这会扭曲它,特别是因为差异是可重复的。至于答案,由于我是 Java 新手,我现在让投票决定哪个答案最好。我会尽量在星期三之前挑一个。
EDIT2: I am going to make another test tonight, where instead of modulus, it just increments a variable, and when it reaches progressCheck, it will perform one, and then reset that variable to 0. for a 3rd option.
EDIT2:今晚我将进行另一个测试,而不是模数,它只是增加一个变量,当它到达progressCheck时,它将执行一个,然后将该变量重置为0。对于第三个选项。
EDIT3.5:
EDIT3.5:
I used this code, and below I'll show my results.. Thank you ALL for the wonderful help! I also tried comparing the short value of the long to 0, so all my new checks happen ever "65536" times making it equal in repeats.
我使用了这段代码,下面我将展示我的结果..谢谢大家的大力帮助!我还尝试将 long 的 short 值与 0 进行比较,因此我所有的新检查都发生了“65536”次,使其重复次数相等。
public class Main {
public static void main(String[] args) {
long startNum = 0;
long stopNum = 1000000000L;
long progressCheck = 65536;
final long finalProgressCheck = 50000;
long date;
// using a fixed value
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++) {
if (i % 65536 == 0) {
System.out.println(i);
}
}
long final1 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
//using a variable
for (long i = startNum; i <= stopNum; i++) {
if (i % progressCheck == 0) {
System.out.println(i);
}
}
long final2 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using a final declared variable
for (long i = startNum; i <= stopNum; i++) {
if (i % finalProgressCheck == 0) {
System.out.println(i);
}
}
long final3 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
// using increments to determine progressCheck
int increment = 0;
for (long i = startNum; i <= stopNum; i++) {
if (increment == 65536) {
System.out.println(i);
increment = 0;
}
increment++;
}
//using a short conversion
long final4 = System.currentTimeMillis() - date;
date = System.currentTimeMillis();
for (long i = startNum; i <= stopNum; i++) {
if ((short)i == 0) {
System.out.println(i);
}
}
long final5 = System.currentTimeMillis() - date;
System.out.println(
"\nfixed = " + final1 + " ms " + "\nvariable = " + final2 + " ms " + "\nfinal variable = " + final3 + " ms " + "\nincrement = " + final4 + " ms" + "\nShort Conversion = " + final5 + " ms");
}
}
Results:
结果:
- fixed = 874 ms (normally around 1000ms, but faster due to it being a power of 2)
- variable = 8590 ms
- final variable = 1944 ms (Was ~1000ms when using 50000)
- increment = 1904 ms
- Short Conversion = 679 ms
- 固定 = 874 毫秒(通常约为 1000 毫秒,但由于它是 2 的幂而更快)
- 变量 = 8590 毫秒
- 最终变量 = 1944 毫秒(使用 50000 时约为 1000 毫秒)
- 增量 = 1904 毫秒
- 短转换 = 679 毫秒
Not surprising enough, due to a lack of division, the Short Conversion was 23% faster than the "fast" way. This is interesting to note. If you need to show or compare something every 256 times (or about there) you can do this, and use
不足为奇,由于缺少除法,短转换比“快速”方式快 23%。注意到这一点很有趣。如果您需要每 256 次(或大约在那里)显示或比较某些内容,您可以这样做,并使用
if ((byte)integer == 0) {'Perform progress check code here'}
ONE FINAL INTERESTING NOTE, using modulus on the "Final declared Variable" with 65536 (not a pretty number) was half the speed of (slower) than the fixed value. Where before it was benchmarking near the same speed.
最后一个有趣的注意事项,在“最终声明的变量”上使用 65536(不是一个漂亮的数字)的模数是固定值速度(慢)的一半。之前它以接近相同的速度进行基准测试。
采纳答案by apangin
You are measuring the OSR (on-stack replacement)stub.
您正在测量OSR(堆栈上替换)存根。
OSR stubis a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.
OSR 存根是编译方法的特殊版本,专门用于在方法运行时将执行从解释模式转移到编译代码。
OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.
OSR 存根不像常规方法那样优化,因为它们需要与解释帧兼容的帧布局。我已经在以下答案中展示了这一点:1, 2, 3。
A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck
local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.
类似的事情也发生在这里。当“低效代码”运行一个长循环时,该方法是专门为循环内部的堆栈替换而编译的。状态从解释的框架转移到 OSR 编译的方法,这个状态包括progressCheck
局部变量。此时 JIT 无法用常量替换变量,因此无法应用某些优化,如强度降低。
In particular this means JIT does not replace integer divisionwith multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division?for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the %
expression also gets optimized by gcc -O0
, similar to here where it's optimized by the JITer even in an OSR stub.)
特别是这意味着 JIT 不会用乘法代替整数除法。(请参阅为什么 GCC 在实现整数除法时使用乘法运算?对于来自提前编译器的 asm 技巧,当值是内联/常量传播后的编译时常量时,如果启用了这些优化. 表达式中的整数文字也由 优化,类似于此处由 JITer 优化甚至在 OSR 存根中。)%
gcc -O0
However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):
但是,如果多次运行相同的方法,第二次和后续运行将执行完全优化的常规(非 OSR)代码。这是证明理论的基准(使用 JMH 进行基准测试):
@State(Scope.Benchmark)
public class Div {
@Benchmark
public void divConst(Blackhole blackhole) {
long startNum = 0;
long stopNum = 100000000L;
for (long i = startNum; i <= stopNum; i++) {
if (i % 50000 == 0) {
blackhole.consume(i);
}
}
}
@Benchmark
public void divVar(Blackhole blackhole) {
long startNum = 0;
long stopNum = 100000000L;
long progressCheck = 50000;
for (long i = startNum; i <= stopNum; i++) {
if (i % progressCheck == 0) {
blackhole.consume(i);
}
}
}
}
And the results:
结果:
# Benchmark: bench.Div.divConst
# Run progress: 0,00% complete, ETA 00:00:16
# Fork: 1 of 1
# Warmup Iteration 1: 126,967 ms/op
# Warmup Iteration 2: 105,660 ms/op
# Warmup Iteration 3: 106,205 ms/op
Iteration 1: 105,620 ms/op
Iteration 2: 105,789 ms/op
Iteration 3: 105,915 ms/op
Iteration 4: 105,629 ms/op
Iteration 5: 105,632 ms/op
# Benchmark: bench.Div.divVar
# Run progress: 50,00% complete, ETA 00:00:09
# Fork: 1 of 1
# Warmup Iteration 1: 844,708 ms/op <-- much slower!
# Warmup Iteration 2: 105,893 ms/op <-- as fast as divConst
# Warmup Iteration 3: 105,601 ms/op
Iteration 1: 105,570 ms/op
Iteration 2: 105,475 ms/op
Iteration 3: 105,702 ms/op
Iteration 4: 105,535 ms/op
Iteration 5: 105,766 ms/op
The very first iteration of divVar
is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.
divVar
由于低效编译的 OSR 存根, 的第一次迭代确实要慢得多。但是,一旦该方法从头开始重新运行,就会执行新的无约束版本,该版本利用了所有可用的编译器优化。
回答by Bishal Dubey
I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:
看到上述代码的性能,我也感到惊讶。这都是关于编译器根据声明的变量执行程序所花费的时间。在第二个(低效)示例中:
for (long i = startNum; i <= stopNum; i++) {
if (i % progressCheck == 0) {
System.out.println(i)
}
}
You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum
and progressCheck
to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.
您正在两个变量之间执行模数运算。在这里,编译器每次迭代后都必须检查这些变量的值stopNum
并progressCheck
转到这些变量的特定内存块,因为它是一个变量,它的值可能会改变。
That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.
这就是为什么每次迭代编译器都会到内存位置检查变量的最新值的原因。因此在编译时编译器无法创建高效的字节码。
In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck
as a final
or as a final static
variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck
with 50000
in code:
在第一个代码示例中,您在变量和常量数值之间执行模数运算符,该数值在执行过程中不会改变,编译器无需从内存位置检查该数值的值。这就是编译器能够创建高效字节码的原因。如果你声明progressCheck
的final
或作为一个final static
变量,然后在运行时/编译时编译器知道的时候,这是一个最终的变量,它的值不会改变,然后编译器替换progressCheck
用50000
的代码:
for (long i = startNum; i <= stopNum; i++) {
if (i % 50000== 0) {
System.out.println(i)
}
}
Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.
现在您可以看到这段代码看起来也像第一个(高效的)代码示例。第一个代码的性能和我们上面提到的两个代码都将有效地工作。两个代码示例的执行时间不会有太大差异。
回答by Oleksandr Pyrohov
In follow-up to @phuclvcomment, I checked the code generated by JIT1, the results are as follows:
在@phuclv comment的后续,我查看了JIT 1生成的代码,结果如下:
for variable % 5000
(division by constant):
对于variable % 5000
(除以常数):
mov rax,29f16b11c6d1e109h
imul rbx
mov r10,rbx
sar r10,3fh
sar rdx,0dh
sub rdx,r10
imul r10,rdx,0c350h ; <-- imul
mov r11,rbx
sub r11,r10
test r11,r11
jne 1d707ad14a0h
for variable % variable
:
对于variable % variable
:
mov rax,r14
mov rdx,8000000000000000h
cmp rax,rdx
jne 22ccce218edh
xor edx,edx
cmp rbx,0ffffffffffffffffh
je 22ccce218f2h
cqo
idiv rax,rbx ; <-- idiv
test rdx,rdx
jne 22ccce218c0h
Because division always takes longer than multiplication, the last code snippet is less performant.
因为除法总是比乘法花费更长的时间,所以最后一个代码片段的性能较低。
Java version:
爪哇版:
java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main
1 - 使用的 VM 选项: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main
回答by JimmyB
As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:
正如其他人所指出的,一般模数运算需要进行除法运算。在某些情况下,除法可以(由编译器)替换为乘法。但与加法/减法相比,两者都可能很慢。因此,可以通过以下方式获得最佳性能:
long progressCheck = 50000;
long counter = progressCheck;
for (long i = startNum; i <= stopNum; i++){
if (--counter == 0) {
System.out.println(i);
counter = progressCheck;
}
}
(As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0
immediately after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically even if you write if (counter++ == 50000) { ... counter = 0; }
.)
(作为一个小的优化尝试,我们在这里使用预递减递减计数器,因为在许多体系结构上,与0
算术运算之后立即相比,算术运算成本恰好为 0 条指令/CPU 周期,因为 ALU 的标志已经由前面的运算正确设置。一个体面的优化但是,即使您编写 ,编译器也会自动进行优化if (counter++ == 50000) { ... counter = 0; }
。)
Notice that often you don't really want/need modulus, because you know that your loop counter (i
) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.
请注意,通常您并不真正想要/需要模数,因为您知道您的循环计数器 ( i
) 或其他任何东西只会增加 1,并且您真的不关心模数会给您带来的实际余数,只需看看如果递增一计数器达到某个值。
Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;
. Modulus a power of two can be quickly calculated via bitwise and
, i.e. if ( (i & (1024-1)) == 0 ) {...}
. This should be pretty fast too, and may on some architectures outperform the explicit counter
above.
另一个“技巧”是使用二的幂值/限制,例如progressCheck = 1024;
。模数 2 的幂可以通过按位快速计算and
,即if ( (i & (1024-1)) == 0 ) {...}
。这也应该非常快,并且可能在某些架构上优于counter
上面的显式。