为什么在 x64 Java 中 long 比 int 慢?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19844048/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 20:38:40  来源:igfitidea点击:

Why is long slower than int in x64 Java?

javaperformance32bit-64bitlong-integer

提问by Techrocket9

I'm running Windows 8.1 x64 with Java 7 update 45 x64 (no 32 bit Java installed) on a Surface Pro 2 tablet.

我在 Surface Pro 2 平板电脑上运行带有 Java 7 更新 45 x64(未安装 32 位 Java)的 Windows 8.1 x64。

The code below takes 1688ms when the type of i is a long and 109ms when i is an int. Why is long (a 64 bit type) an order of magnitude slower than int on a 64 bit platform with a 64 bit JVM?

当 i 的类型为 long 时,下面的代码需要 1688 毫秒,当 i 为 int 时,需要 109 毫秒。为什么在具有 64 位 JVM 的 64 位平台上 long(64 位类型)比 int 慢一个数量级?

My only speculation is that the CPU takes longer to add a 64 bit integer than a 32 bit one, but that seems unlikely. I suspect Haswell doesn't use ripple-carry adders.

我唯一的猜测是 CPU 添加 64 位整数比添加 32 位整数需要更长的时间,但这似乎不太可能。我怀疑 Haswell 不使用波纹进位加法器。

I'm running this in Eclipse Kepler SR1, btw.

我在 Eclipse Kepler SR1 中运行它,顺便说一句。

public class Main {

    private static long i = Integer.MAX_VALUE;

    public static void main(String[] args) {    
        System.out.println("Starting the loop");
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheck()){
        }
        long endTime = System.currentTimeMillis();
        System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
    }

    private static boolean decrementAndCheck() {
        return --i < 0;
    }

}

Edit: Here are the results from equivalent C++ code compiled by VS 2013 (below), same system. long: 72265ms int: 74656msThose results were in debug 32 bit mode.

编辑:以下是相同系统的 VS 2013(如下)编译的等效 C++ 代码的结果。 长:72265ms 整数:74656ms这些结果处于调试 32 位模式。

In 64 bit release mode: long: 875mslong long: 906ms int: 1047ms

在 64 位发布模式下: 长:875ms长长:906ms 整数:1047ms

This suggests that the result I observed is JVM optimization weirdness rather than CPU limitations.

这表明我观察到的结果是 JVM 优化的怪异,而不是 CPU 限制。

#include "stdafx.h"
#include "iostream"
#include "windows.h"
#include "limits.h"

long long i = INT_MAX;

using namespace std;


boolean decrementAndCheck() {
return --i < 0;
}


int _tmain(int argc, _TCHAR* argv[])
{


cout << "Starting the loop" << endl;

unsigned long startTime = GetTickCount64();
while (!decrementAndCheck()){
}
unsigned long endTime = GetTickCount64();

cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl;



}

Edit: Just tried this again in Java 8 RTM, no significant change.

编辑:刚刚在 Java 8 RTM 中再次尝试了这个,没有显着变化。

采纳答案by tmyklebu

My JVM does this pretty straightforward thing to the inner loop when you use longs:

当您使用longs时,我的 JVM 对内部循环执行了非常简单的操作:

0x00007fdd859dbb80: test   %eax,0x5f7847a(%rip)  /* fun JVM hack */
0x00007fdd859dbb86: dec    %r11                  /* i-- */
0x00007fdd859dbb89: mov    %r11,0x258(%r10)      /* store i to memory */
0x00007fdd859dbb90: test   %r11,%r11             /* unnecessary test */
0x00007fdd859dbb93: jge    0x00007fdd859dbb80    /* go back to the loop top */

It cheats, hard, when you use ints; first there's some screwiness that I don't claim to understand but looks like setup for an unrolled loop:

当您使用ints时,它很难作弊;首先有一些我没有声称理解但看起来像展开循环的设置:

0x00007f3dc290b5a1: mov    %r11d,%r9d
0x00007f3dc290b5a4: dec    %r9d
0x00007f3dc290b5a7: mov    %r9d,0x258(%r10)
0x00007f3dc290b5ae: test   %r9d,%r9d
0x00007f3dc290b5b1: jl     0x00007f3dc290b662
0x00007f3dc290b5b7: add    
0x00007f3dc290b640: add    
0x00007f3dc290b64f: cmp    
public class foo136 {
  private static int i = Integer.MAX_VALUE;
  public static void main(String[] args) {
    System.out.println("Starting the loop");
    for (int foo = 0; foo < 100; foo++)
      doit();
  }

  static void doit() {
    i = Integer.MAX_VALUE;
    long startTime = System.currentTimeMillis();
    while(!decrementAndCheck()){
    }
    long endTime = System.currentTimeMillis();
    System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
  }

  private static boolean decrementAndCheck() {
    return --i < 0;
  }
}
xffffffffffffffff,%ecx 0x00007f3dc290b652: jle 0x00007f3dc290b662 0x00007f3dc290b654: dec %ecx 0x00007f3dc290b656: mov %ecx,0x258(%r10) 0x00007f3dc290b65d: cmp
private static boolean decrementAndCheck();
  Code:
     0: getstatic     #14  // Field i:I
     3: iconst_1      
     4: isub          
     5: dup           
     6: putstatic     #14  // Field i:I
     9: ifge          16
    12: iconst_1      
    13: goto          17
    16: iconst_0      
    17: ireturn       
xffffffffffffffff,%ecx 0x00007f3dc290b660: jg 0x00007f3dc290b654
xfffffffffffffff0,%ecx 0x00007f3dc290b643: mov %ecx,0x258(%r10) 0x00007f3dc290b64a: cmp %r11d,%ecx 0x00007f3dc290b64d: jg 0x00007f3dc290b640
xfffffffffffffffe,%r11d 0x00007f3dc290b5bb: mov %r9d,%ecx 0x00007f3dc290b5be: dec %ecx 0x00007f3dc290b5c0: mov %ecx,0x258(%r10) 0x00007f3dc290b5c7: cmp %r11d,%ecx 0x00007f3dc290b5ca: jle 0x00007f3dc290b5d1 0x00007f3dc290b5cc: mov %ecx,%r9d 0x00007f3dc290b5cf: jmp 0x00007f3dc290b5bb 0x00007f3dc290b5d1: and
private static boolean decrementAndCheck();
  Code:
     0: getstatic     #14  // Field i:J
     3: lconst_1      
     4: lsub          
     5: dup2          
     6: putstatic     #14  // Field i:J
     9: lconst_0      
    10: lcmp          
    11: ifge          18
    14: iconst_1      
    15: goto          19
    18: iconst_0      
    19: ireturn       
xfffffffffffffffe,%r9d 0x00007f3dc290b5d5: mov %r9d,%r8d 0x00007f3dc290b5d8: neg %r8d 0x00007f3dc290b5db: sar
public class LongSpeed {

    private static long i = Integer.MAX_VALUE;
    private static int j = Integer.MAX_VALUE;

    public static void main(String[] args) {

        for (int x = 0; x < 10; x++) {
            runLong();
            runWord();
        }
    }

    private static void runLong() {
        System.out.println("Starting the long loop");
        i = Integer.MAX_VALUE;
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheckI()){

        }
        long endTime = System.currentTimeMillis();

        System.out.println("Finished the long loop in " + (endTime - startTime) + "ms");
    }

    private static void runWord() {
        System.out.println("Starting the word loop");
        j = Integer.MAX_VALUE;
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheckJ()){

        }
        long endTime = System.currentTimeMillis();

        System.out.println("Finished the word loop in " + (endTime - startTime) + "ms");
    }

    private static boolean decrementAndCheckI() {
        return --i < 0;
    }

    private static boolean decrementAndCheckJ() {
        return --j < 0;
    }

}
x1f,%r8d 0x00007f3dc290b5df: shr
boolean decrementAndCheckLong() {
    lo = lo - 1l;
    return lo < -1l;
}
x1f,%r8d 0x00007f3dc290b5e3: sub %r9d,%r8d 0x00007f3dc290b5e6: sar %r8d 0x00007f3dc290b5e9: neg %r8d 0x00007f3dc290b5ec: and
timeIntDecrements         195,266,845.000
timeLongDecrements      2,321,447,978.000
xfffffffffffffffe,%r8d 0x00007f3dc290b5f0: shl %r8d 0x00007f3dc290b5f3: mov %r8d,%r11d 0x00007f3dc290b5f6: neg %r11d 0x00007f3dc290b5f9: sar
package test;

import com.google.caliper.Benchmark;
import com.google.caliper.Param;

public final class App {

    @Param({""+1}) int number;

    private static class IntTest {
        public static int v;
        public static void reset() {
            v = Integer.MAX_VALUE;
        }
        public static boolean decrementAndCheck() {
            return --v < 0;
        }
    }

    private static class LongTest {
        public static long v;
        public static void reset() {
            v = Integer.MAX_VALUE;
        }
        public static boolean decrementAndCheck() {
            return --v < 0;
        }
    }

    @Benchmark
    int timeLongDecrements(int reps) {
        int k=0;
        for (int i=0; i<reps; i++) {
            LongTest.reset();
            while (!LongTest.decrementAndCheck()) { k++; }
        }
        return (int)LongTest.v | k;
    }    

    @Benchmark
    int timeIntDecrements(int reps) {
        int k=0;
        for (int i=0; i<reps; i++) {
            IntTest.reset();
            while (!IntTest.decrementAndCheck()) { k++; }
        }
        return IntTest.v | k;
    }
}
x1f,%r11d 0x00007f3dc290b5fd: shr ##代码##x1e,%r11d 0x00007f3dc290b601: sub %r8d,%r11d 0x00007f3dc290b604: sar ##代码##x2,%r11d 0x00007f3dc290b608: neg %r11d 0x00007f3dc290b60b: and ##代码##xfffffffffffffffe,%r11d 0x00007f3dc290b60f: shl ##代码##x2,%r11d 0x00007f3dc290b613: mov %r11d,%r9d 0x00007f3dc290b616: neg %r9d 0x00007f3dc290b619: sar ##代码##x1f,%r9d 0x00007f3dc290b61d: shr ##代码##x1d,%r9d 0x00007f3dc290b621: sub %r11d,%r9d 0x00007f3dc290b624: sar ##代码##x3,%r9d 0x00007f3dc290b628: neg %r9d 0x00007f3dc290b62b: and ##代码##xfffffffffffffffe,%r9d 0x00007f3dc290b62f: shl ##代码##x3,%r9d 0x00007f3dc290b633: mov %ecx,%r11d 0x00007f3dc290b636: sub %r9d,%r11d 0x00007f3dc290b639: cmp %r11d,%ecx 0x00007f3dc290b63c: jle 0x00007f3dc290b64f 0x00007f3dc290b63e: xchg %ax,%ax /* OK, fine; I know what a nop looks like */

then the unrolled loop itself:

然后展开循环本身:

##代码##

then the teardown code for the unrolled loop, itself a test and a straight loop:

然后是展开循环的拆卸代码,它本身是一个测试和一个直接循环:

##代码##

So it goes 16 times faster for ints because the JIT unrolled the intloop 16 times, but didn't unroll the longloop at all.

所以它比整数快 16 倍,因为 JIT 将int循环展开了16 次,但根本没有展开long循环。

For completeness, here is the code I actually tried:

为了完整起见,这是我实际尝试过的代码:

##代码##

The assembly dumps were generated using the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly. Note that you need to mess around with your JVM installation to have this work for you as well; you need to put some random shared library in exactly the right place or it will fail.

程序集转储是使用选项生成的-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly。请注意,您需要弄乱 JVM 安装才能使这项工作也适合您;您需要将一些随机共享库放在正确的位置,否则它将失败。

回答by Vaibhav Raj

Basic unit of data in a Java Virtual Machine is word. Choosing the right word size is left upon the implementation of the JVM. A JVM implementation should choose a minimum word size of 32 bits. It can choose a higher word size to gain efficiency. Neither there is any restriction that a 64 bit JVM should choose 64 bit word only.

Java 虚拟机中数据的基本单位是字。选择正确的字长取决于 JVM 的实现。JVM 实现应选择 32 位的最小字长。它可以选择更高的字长来提高效率。也没有任何限制 64 位 JVM 只能选择 64 位字。

The underlying architecture doesn't rules that the word size should also be the same. JVM reads/writes data word by word. This is the reason why it might be taking longer for a longthan an int.

底层架构并不规定字长也应该相同。JVM 逐字读取/写入数据。这就是为什么它可能需要更长的时间的原因INT

Hereyou can find more on the same topic.

在这里,您可以找到有关同一主题的更多信息。

回答by chrylis -cautiouslyoptimistic-

The JVM stack is defined in terms of words, whose size is an implementation detail but must be at least 32 bits wide. The JVM implementer mayuse 64-bit words, but the bytecode can't rely on this, and so operations with longor doublevalues have to be handled with extra care. In particular, the JVM integer branch instructionsare defined on exactly the type int.

JVM 堆栈是根据words定义的,其大小是一个实现细节,但必须至少为 32 位宽。JVM 实现者可能使用 64 位字,但字节码不能依赖于此,因此必须格外小心地处理longdouble值的操作。特别是,JVM 整数分支指令是在 type 上定义的int

In the case of your code, disassembly is instructive. Here's the bytecode for the intversion as compiled by the Oracle JDK 7:

对于您的代码,反汇编是有益的。这是int由 Oracle JDK 7 编译的版本的字节码:

##代码##

Note that the JVM will load the value of your static i(0), subtract one (3-4), duplicate the value on the stack (5), and push it back into the variable (6). It then does a compare-with-zero branch and returns.

请注意,JVM 将加载您的静态值i(0),减去一 (3-4),复制堆栈上的值 (5),并将其推回变量 (6)。然后它执行一个与零比较的分支并返回。

The version with the longis a bit more complicated:

带有 的版本long有点复杂:

##代码##

First, when the JVM duplicates the new value on the stack (5), it has to duplicate two stack words. In your case, it's quite possible that this is no more expensive than duplicating one, since the JVM is free to use a 64-bit word if convenient. However, you'll notice that the branch logic is longer here. The JVM doesn't have an instruction to compare a longwith zero, so it has to push a constant 0Lonto the stack (9), do a general longcomparison (10), and then branch on the value of thatcalculation.

首先,当 JVM 在堆栈上复制新值 (5) 时,它必须复制两个堆栈字。在您的情况下,这很可能并不比复制一个更昂贵,因为如果方便,JVM 可以自由使用 64 位字。但是,您会注意到这里的分支逻辑更长。JVM 没有将 along与零进行比较的指令,因此它必须将一个常量压入0L堆栈 (9),进行一般long比较 (10),然后对该计算的值进行分支。

Here are two plausible scenarios:

以下是两种可能的情况:

  • The JVM is following the bytecode path exactly. In this case, it's doing more work in the longversion, pushing and popping several extra values, and these are on the virtual managed stack, not the real hardware-assisted CPU stack. If this is the case, you'll still see a significant performance difference after warmup.
  • The JVM realizes that it can optimize this code. In this case, it's taking extra time to optimize away some of the practically unnecessary push/compare logic. If this is the case, you'll see very little performance difference after warmup.
  • JVM 完全遵循字节码路径。在这种情况下,它在long版本中做了更多的工作,推送和弹出几个额外的值,这些是在虚拟托管堆栈上,而不是真正的硬件辅助 CPU 堆栈。如果是这种情况,您在预热后仍会看到显着的性能差异。
  • JVM 意识到它可以优化此代码。在这种情况下,优化掉一些实际上不必要的推送/比较逻辑需要额外的时间。如果是这种情况,您将在预热后看到非常小的性能差异。

I recommend you write a correct microbenchmarkto eliminate the effect of having the JIT kick in, and also trying this with a final condition that isn't zero, to force the JVM to do the same comparison on the intthat it does with the long.

我建议你写一个正确的微基准,以消除其在JIT踢,也与不为零的最终条件尝试这个,迫使JVM上做了相同的比较的效果int,它与做long

回答by Durandal

I don't have a 64 bit machine to test with, but the rather large difference suggests that there is more than the slightly longer bytecode at work.

我没有要测试的 64 位机器,但相当大的差异表明,工作中的字节码不仅仅是稍长的字节码。

I see very close times for long/int (4400 vs 4800ms) on my 32-bit 1.7.0_45.

我在 32 位 1.7.0_45 上看到 long/int(4400 对 4800 毫秒)的时间非常接近。

This is only a guess, but I stronglysuspect that it is the effect of a memory misalignment penalty. To confirm/deny the suspicion, try adding a public static int dummy = 0; beforethe declaration of i. That will push i down by 4 bytes in memory layout and may make it properly aligned for better performance.Confirmed to be not causing the issue.

这只是一个猜测,但我强烈怀疑这是内存未对齐惩罚的影响。要确认/否认怀疑,请尝试添加一个 public static int dummy = 0; i 声明之前。这将在内存布局中将 i 向下推 4 个字节,并可能使其正确对齐以获得更好的性能。确认不是导致问题的原因。

EDIT: The reasoning behind this is that the VM may not reorder fieldsat its leisure adding padding for optimal alignment, since that may interfere with JNI(Not the case).

编辑: 这背后的原因是 VM 可能不会在空闲时重新排序字段添加填充以获得最佳对齐,因为这可能会干扰 JNI(并非如此)。

回答by Hot Licks

For the record, this version does a crude "warmup":

为了记录,这个版本做了一个粗略的“热身”:

##代码##

The overall times improve about 30%, but the ratio between the two remains roughly the same.

总体时间提高了约 30%,但两者之间的比率大致保持不变。

回答by R.Moeller

For the records:

对于记录:

if i use

如果我使用

##代码##

(changed "l--" to "l = l - 1l") long performance improves by ~50%

(将“l--”更改为“l = l - 1l”)长时间性能提高了约 50%

回答by tucuxi

I have just written a benchmark using caliper.

我刚刚使用caliper编写了一个基准测试。

The resultsare quite consistent with the original code: a ~12x speedup for using intover long. It certainly seems that the loop unrolling reported by tmyklebuor something very similar is going on.

结果与原来的代码相当一致:一〜12倍的加速使用intlong。看来tmyklebu 报告的循环展开或非常类似的事情正在发生。

##代码##

This is my code; note that it uses a freshly-built snapshot of caliper, since I could not figure out how to code against their existing beta release.

这是我的代码;请注意,它使用了一个新构建的快照caliper,因为我无法弄清楚如何针对他们现有的测试版进行编码。

##代码##