为什么 40 亿次迭代的 Java 循环只需要 2 毫秒？

Question

提问by twimo

I'm running the following Java code on a laptop with 2.7 GHz Intel Core i7. I intended to let it measure how long it takes to finish a loop with 2^32 iterations, which I expected to be roughly 1.48 seconds (4/2.7 = 1.48).

我在配备 2.7 GHz Intel Core i7 的笔记本电脑上运行以下 Java 代码。我打算让它测量完成 2^32 次迭代所需的时间，我预计大约需要 1.48 秒（4/2.7 = 1.48）。

But actually it only takes 2 milliseconds, instead of 1.48 s. I'm wondering if this is a result of any JVM optimization underneath?

但实际上只需要 2 毫秒，而不是 1.48 秒。我想知道这是否是底层任何 JVM 优化的结果？

public static void main(String[] args)
{
    long start = System.nanoTime();

    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++){
    }
    long finish = System.nanoTime();
    long d = (finish - start) / 1000000;

    System.out.println("Used " + d);
}

Answer 1

回答by van dench

There are one of two possibilities going on here:

这里有两种可能性之一：

The compiler realized that the loop is redundant and doing nothing so it optimized it away.
The JIT (just-in-time compiler) realized that the loop is redundant and doing nothing, so it optimized it away.

编译器意识到循环是多余的并且什么都不做，所以它优化了它。
JIT（即时编译器）意识到循环是多余的并且什么都不做，所以它优化了它。

Modern compilers are very intelligent; they can see when code is useless. Try putting an empty loop into GodBoltand look at the output, then turn on -O2optimizations, you will see that the output is something along the lines of

现代编译器非常智能；他们可以看到代码何时无用。尝试将一个空循环放入GodBolt并查看输出，然后打开-O2优化，您将看到输出类似于

main():
    xor eax, eax
    ret

I would like to clarify something, in Java most of the optimizations are done by the JIT. In some other languages (like C/C++) most of the optimizations are done by the first compiler.

我想澄清一些事情，在 Java 中，大部分优化都是由 JIT 完成的。在其他一些语言（如 C/C++）中，大多数优化是由第一个编译器完成的。

Answer 2

回答by Akavall

It looks like it was optimized away by JIT compiler. When I turn it off (-Djava.compiler=NONE), the code runs much slower:

看起来它被 JIT 编译器优化掉了。当我关闭它 ( -Djava.compiler=NONE) 时，代码运行得更慢：

$ javac MyClass.java
$ java MyClass
Used 4
$ java -Djava.compiler=NONE MyClass
Used 40409

I put OP's code inside of class MyClass.

我将 OP 的代码放在class MyClass.

Answer 3

回答by Eugene

I just will state the obvious - that this is a JVM optimization that happens, the loop will simply be remove at all. Here is a small test that shows what a hugedifference JIThas when enabled/enabled only for C1 Compilerand disabled at all.

我只是要说明一个明显的 - 这是一个 JVM 优化发生，循环将被简单地删除。这是一个小测试，显示了仅启用/启用和完全禁用时的巨大差异。JITC1 Compiler

Disclaimer: don't write tests like this - this is just to prove that the actual loop "removal" happens in the C2 Compiler:

免责声明：不要写这样的测试 - 这只是为了证明实际循环“删除”发生在C2 Compiler：

@Benchmark
@Fork(1)
public void full() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        ++result;
    }
}

@Benchmark
@Fork(1)
public void minusOne() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-XX:TieredStopAtLevel=1" })
public void withoutC2() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-Xint" })
public void withoutAll() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

The results show that depending on which part of the JITis enabled, method gets faster (so much faster that it looks like it's doing "nothing" - loop removal, which seems to be happening in the C2 Compiler- which is the maximum level):

结果表明，根据启用的哪个部分JIT，方法变得更快（快得多，看起来它似乎在“什么都不做” - 循环删除，这似乎发生在C2 Compiler- 这是最高级别）：

 Benchmark                Mode  Cnt      Score   Error  Units
 Loop.full        avgt    2     ≈ 10??          ms/op
 Loop.minusOne    avgt    2     ≈ 10??          ms/op
 Loop.withoutAll  avgt    2  51782.751          ms/op
 Loop.withoutC2   avgt    2   1699.137          ms/op

Answer 4

回答by Oleksandr Pyrohov

As already pointed out, JIT(just-in-time) compiler can optimize an empty loop in order to remove unnecessary iterations. But how?

正如已经指出的那样，JIT（即时）编译器可以优化空循环以删除不必要的迭代。但是如何？

Actually, there are two JIT compilers: C1& C2. First, the code is compiled with the C1. C1 collects the statistics and helps the JVM to discover that in 100% cases our empty loop doesn't change anything and is useless. In this situation C2 enters the stage. When the code is get called very often, it can be optimized and compiled with the C2 using collected statistics.

实际上，有两个 JIT 编译器：C1和C2。首先，代码是用C1编译的。C1 收集统计信息并帮助 JVM 发现在 100% 的情况下我们的空循环不会改变任何东西并且是无用的。在这种情况下，C2 进入阶段。当代码被频繁调用时，可以使用收集的统计信息使用 C2 对其进行优化和编译。

As an example, I will test the next code snippet (my JDK is set to slowdebug build 9-internal):

例如，我将测试下一个代码片段（我的 JDK 设置为slowdebug build 9-internal）：

public class Demo {
    private static void run() {
        for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        }
        System.out.println("Done!");
    }
}

With the following command line options:

使用以下命令行选项：

-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Demo.run

And there are different versions of my runmethod, compiled with the C1 and C2 appropriately. For me, the final variant (C2) looks something like this:

我的run方法有不同版本，分别使用 C1 和 C2 编译。对我来说，最终的变体 (C2) 看起来像这样：

...

; B1: # B3 B2 <- BLOCK HEAD IS JUNK  Freq: 1
0x00000000125461b0: mov   dword ptr [rsp+0ffffffffffff7000h], eax
0x00000000125461b7: push  rbp
0x00000000125461b8: sub   rsp, 40h
0x00000000125461bc: mov   ebp, dword ptr [rdx]
0x00000000125461be: mov   rcx, rdx
0x00000000125461c1: mov   r10, 57fbc220h
0x00000000125461cb: call  indirect r10    ; *iload_1

0x00000000125461ce: cmp   ebp, 7fffffffh  ; 7fffffff => 2147483647
0x00000000125461d4: jnl   125461dbh       ; jump if not less

; B2: # B3 <- B1  Freq: 0.999999
0x00000000125461d6: mov   ebp, 7fffffffh  ; *if_icmpge

; B3: # N44 <- B1 B2  Freq: 1       
0x00000000125461db: mov   edx, 0ffffff5dh
0x0000000012837d60: nop
0x0000000012837d61: nop
0x0000000012837d62: nop
0x0000000012837d63: call  0ae86fa0h

...

It is a little bit messy, but If you look closely, you may notice that there is no long running loop here. There are 3 blocks: B1, B2 and B3 and the execution steps can be B1 -> B2 -> B3or B1 -> B3. Where Freq: 1- normalized estimated frequency of a block execution.

有点乱，但是如果仔细观察，您可能会注意到这里没有长时间运行的循环。有 3 个块：B1、B2 和 B3，执行步骤可以是B1 -> B2 -> B3或B1 -> B3。其中Freq: 1- 块执行的标准化估计频率。

Answer 5

回答by Peter Lawrey

You are measuring the time it take to detect the loop doesn't do anything, compile the code in a background thread and eliminate the code.

您正在测量检测循环不执行任何操作所需的时间，在后台线程中编译代码并消除代码。

for (int t = 0; t < 5; t++) {
    long start = System.nanoTime();
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }
    long time = System.nanoTime() - start;

    String s = String.format("%d: Took %.6f ms", t, time / 1e6);
    Thread.sleep(50);
    System.out.println(s);
    Thread.sleep(50);
}

If you run this with -XX:+PrintCompilationyou can see the code has been compiled in the background to level 3 or C1 compiler and after a few loops to level 4 of C4.

如果你用它运行，-XX:+PrintCompilation你可以看到代码已经在后台编译到 3 级或 C1 编译器，并在几次循环后编译到 C4 级 4。

    129   34 %     3       A::main @ 15 (93 bytes)
    130   35       3       A::main (93 bytes)
    130   36 %     4       A::main @ 15 (93 bytes)
    131   34 %     3       A::main @ -2 (93 bytes)   made not entrant
    131   36 %     4       A::main @ -2 (93 bytes)   made not entrant
0: Took 2.510408 ms
    268   75 %     3       A::main @ 15 (93 bytes)
    271   76 %     4       A::main @ 15 (93 bytes)
    274   75 %     3       A::main @ -2 (93 bytes)   made not entrant
1: Took 5.629456 ms
2: Took 0.000000 ms
3: Took 0.000364 ms
4: Took 0.000365 ms

If you change the loop to use a longit doesn't get as optimised.

如果您将循环更改为使用 along它不会得到优化。

    for (long i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }

instead you get

相反，你得到

0: Took 1579.267321 ms
1: Took 1674.148662 ms
2: Took 1885.692166 ms
3: Took 1709.870567 ms
4: Took 1754.005112 ms

Answer 6

回答by DHARMENDRA SINGH

You consider start and finish time in nanosecond and you divide by 10^6 for calculate the latency

您以纳秒为单位考虑开始和结束时间，然后除以 10^6 以计算延迟

long d = (finish - start) / 1000000

it should be 10^9because 1second = 10^9nanosecond.

应该是10^9因为1秒 =10^9纳秒。

为什么 40 亿次迭代的 Java 循环只需要 2 毫秒？

提问by twimo

回答by van dench

回答by Akavall

回答by Eugene

回答by Oleksandr Pyrohov

回答by Peter Lawrey

回答by DHARMENDRA SINGH

相关推荐

最近更新

标签

为什么 40 亿次迭代的 Java 循环只需要 2 毫秒？

提问by twimo

回答by van dench

回答by Akavall

回答by Eugene

回答by Oleksandr Pyrohov

回答by Peter Lawrey

回答by DHARMENDRA SINGH

相关推荐

java NoSuchBeanDefinitionException：未定义名为“name”的 bean

java Spring MockMvc - 如何测试 REST 控制器的删除请求？

java 无法获得 Jedis 连接；无法从池中获取资源

java Spring Kafka - 如何使用组 ID 将偏移量重置为最新？

相关推荐

最近更新

标签