为什么在 Java 中 (a*b != 0) 比 (a != 0 && b != 0) 快?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35531369/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 16:47:09  来源:igfitidea点击:

Why is (a*b != 0) faster than (a != 0 && b != 0) in Java?

javaperformanceprocessing-efficiencymicrobenchmarkbranch-prediction

提问by Maljam

I'm writing some code in Java where, at some point, the flow of the program is determined by whether two int variables, "a" and "b", are non-zero (note: a and b are never negative, and never within integer overflow range).

我正在用 Java 编写一些代码,在某些时候,程序的流程取决于两个 int 变量“a”和“b”是否为非零(注意:a 和 b 永远不会为负,并且永远不会在整数溢出范围内)。

I can evaluate it with

我可以评估它

if (a != 0 && b != 0) { /* Some code */ }

Or alternatively

或者替代地

if (a*b != 0) { /* Some code */ }

Because I expect that piece of code to run millions of times per run, I was wondering which one would be faster. I did the experiment by comparing them on a huge randomly generated array, and I was also curious to see how the sparsity of the array (fraction of data = 0) would affect the results:

因为我希望这段代码每次运行数百万次,所以我想知道哪一个会更快。我通过在一个巨大的随机生成的数组上比较它们来进行实验,我也很好奇数组的稀疏性(数据的分数 = 0)会如何影响结果:

long time;
final int len = 50000000;
int arbitrary = 0;
int[][] nums = new int[2][len];

for (double fraction = 0 ; fraction <= 0.9 ; fraction += 0.0078125) {
    for(int i = 0 ; i < 2 ; i++) {
        for(int j = 0 ; j < len ; j++) {
            double random = Math.random();

            if(random < fraction) nums[i][j] = 0;
            else nums[i][j] = (int) (random*15 + 1);
        }
    }

    time = System.currentTimeMillis();

    for(int i = 0 ; i < len ; i++) {
        if( /*insert nums[0][i]*nums[1][i]!=0 or nums[0][i]!=0 && nums[1][i]!=0*/ ) arbitrary++;
    }
    System.out.println(System.currentTimeMillis() - time);
}

And the results show that if you expect "a" or "b" to be equal to 0 more than ~3% of the time, a*b != 0is faster than a!=0 && b!=0:

结果表明,如果您期望“a”或“b”在超过 3% 的时间内等于 0,a*b != 0则比 快a!=0 && b!=0

Graphical graph of the results of a AND b non-zero

a AND b 非零结果的图形图

I'm curious to know why. Could anyone shed some light? Is it the compiler or is it at the hardware level?

我很想知道为什么。有人能解释一下吗?是编译器还是硬件级别?

Edit:Out of curiosity...now that I learned about branch prediction, I was wondering what the analog comparison would show for a ORb is non-zero:

编辑:出于好奇......现在我了解了分支预测,我想知道模拟比较会显示一个ORb 是非零:

Graph of a or b non-zero

a 或 b 非零的图

We do see the same effect of branch prediction as expected, interestingly the graph is somewhat flipped along the X-axis.

我们确实看到了与预期相同的分支预测效果,有趣的是,该图沿 X 轴有些翻转。

Update

更新

1- I added !(a==0 || b==0)to the analysis to see what happens.

1- 我添加!(a==0 || b==0)到分析中以查看会发生什么。

2- I also included a != 0 || b != 0, (a+b) != 0and (a|b) != 0out of curiosity, after learning about branch prediction. But they are not logically equivalent to the other expressions, because only a ORb needs to be non-zero to return true, so they are not meant to be compared for processing efficiency.

2 -我也包括在内a != 0 || b != 0(a+b) != 0(a|b) != 0出于好奇,了解分支预测之后。但它们在逻辑上并不等价于其他表达式,因为只有 a ORb 需要非零才能返回 true,因此它们并不意味着为了处理效率而进行比较。

3- I also added the actual benchmark that I used for the analysis, which is just iterating an arbitrary int variable.

3- 我还添加了用于分析的实际基准测试,它只是迭代任意 int 变量。

4- Some people were suggesting to include a != 0 & b != 0as opposed to a != 0 && b != 0, with the prediction that it would behave more closely to a*b != 0because we would remove the branch prediction effect. I didn't know that &could be used with boolean variables, I thought it was only used for binary operations with integers.

4- 有些人建议包括a != 0 & b != 0而不是a != 0 && b != 0,预测它会表现得更接近,a*b != 0因为我们将消除分支预测效果。我不知道它&可以与布尔变量一起使用,我认为它仅用于整数的二元运算。

Note: In the context that I was considering all this, int overflow is not an issue, but that's definitely an important consideration in general contexts.

注意:在我考虑所有这些的情况下,int 溢出不是问题,但这在一般情况下绝对是一个重要的考虑因素。

CPU: Intel Core i7-3610QM @ 2.3GHz

CPU:英特尔酷睿 i7-3610QM @ 2.3GHz

Java version: 1.8.0_45
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Java 版本:1.8.0_45
Java(TM) SE 运行时环境(构建 1.8.0_45-b14)
Java HotSpot(TM) 64 位服务器 VM(构建 25.45-b02,混合模式)

采纳答案by Stephen C

I'm ignoring the issue that your benchmarking mightbe flawed, and taking the result at face value.

我忽略了您的基准测试可能存在缺陷的问题,并将结果视为表面价值。

Is it the compiler or is it at the hardware level?

是编译器还是硬件级别?

That latter, I think:

后者,我认为:

  if (a != 0 && b != 0)

will compile to 2 memory loads and two conditional branches

将编译为 2 个内存负载和两个条件分支

  if (a * b != 0)

will compile to 2 memory loads, a multiply and one conditional branch.

将编译为 2 个内存负载,一个乘法和一个条件分支。

The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. As you increase the ratio ... the branch prediction is becoming less effective.

如果硬件级分支预测无效,则乘法可能比第二个条件分支更快。随着比率的增加……分支预测变得越来越不有效。

The reason that conditional branches are slower is that they cause the instruction execution pipeline to stall. Branch prediction is about avoiding the stall by predicting which way the branch is going to go and speculatively choosing the next instruction based on that. If the prediction fails, there is a delay while the instruction for the other direction is loaded.

条件分支较慢的原因是它们会导致指令执行流水线停顿。分支预测是通过预测分支将走向何方并据此推测性地选择下一条指令来避免停顿。如果预测失败,则在加载另一个方向的指令时会出现延迟。

(Note: the above explanation is oversimplified. For a more accurate explanation, you need to look at the literature provided by the CPU manufacturer for assembly language coders and compiler writers. The Wikipedia page on Branch Predictorsis good background.)

(注意:上面的解释过于简单了。要获得更准确的解释,您需要查看 CPU 制造商为汇编语言编码人员和编译器编写人员提供的文献。Branch Predictors上的维基百科页面是很好的背景。)



However, there is one thing that you need to be careful about with this optimization. Are there any values where a * b != 0will give the wrong answer? Consider cases where computing the product results in integer overflow.

但是,您需要注意此优化的一件事。是否有任何值a * b != 0会给出错误答案?考虑计算乘积导致整数溢出的情况。



UPDATE

更新

Your graphs tend to confirm what I said.

你的图表往往证实了我所说的。

  • There is also a "branch prediction" effect in the conditional branch a * b != 0case, and this comes out in the graphs.

  • If you project the curves beyond 0.9 on the X-axis, it looks like 1) they will meet at about 1.0 and 2) the meeting point will be at roughly the same Y value as for X = 0.0.

  • 在条件分支a * b != 0情况下也有“分支预测”效应,这在图表中表现出来。

  • 如果您在 X 轴上投影超过 0.9 的曲线,则看起来像 1) 它们将在大约 1.0 处相交,2) 相交点将与 X = 0.0 的 Y 值大致相同。



UPDATE 2

更新 2

I don't understand why the curves are different for the a + b != 0and the a | b != 0cases. There could besomething clever in the branch predictors logic. Or it could indicate something else.

我不明白为什么曲线a + b != 0a | b != 0案例不同。有可能是一些在分支预测逻辑聪明。或者它可以表示其他东西。

(Note that this kind of thing can be specific to a particular chip model number or even version. The results of your benchmarks could be different on other systems.)

(请注意,这种事情可能特定于特定的芯片型号甚至版本。您的基准测试结果在其他系统上可能会有所不同。)

However, they both have the advantage of working for all non-negative values of aand b.

然而,它们都具有对所有非负值工作的优势ab

回答by Sanket Gupte

When we take the multiplication, even if one number is 0, then the product is 0. While writing

当我们做乘法时,即使一个数是 0,那么乘积也是 0。在写的时候

    (a*b != 0)

It evaluates the result of the product thereby eliminating the first few occurrences of the iteration starting from 0. As a result the comparisons are less than that when the condition is

它评估乘积的结果,从而消除从 0 开始的迭代的前几次出现。因此,比较小于条件为时的比较

   (a != 0 && b != 0)

Where every element is compared with 0 and evaluated. Hence the time required is less. But I believe that the second condition might give you more accurate solution.

其中每个元素都与 0 进行比较并进行评估。因此所需的时间更少。但我相信第二个条件可能会给你更准确的解决方案。

回答by Boann

I think your benchmark has some flaws and might not be useful for inferring about real programs. Here are my thoughts:

我认为您的基准测试存在一些缺陷,对于推断真实程序可能没有用。以下是我的想法:

  • (a|b)!=0and (a+b)!=0test if eithervalue is non-zero, whereas a != 0 && b != 0and (a*b)!=0test if bothare non-zero. So you are not comparing the timing of just the arithmetic: if the condition is true more often, it causes more executions of the ifbody, which takes more time too.

  • (a+b)!=0will do the wrong thing for positive and negative values that sum to zero, so you can't use it in the general case, even if it works here.

  • Similarly, (a*b)!=0will do the wrong thing for values that overflow. (Random example: 196608 * 327680 is 0 because the true result happens to be divisible by 232, so its low 32 bits are 0, and those bits are all you get if it's an intoperation.)

  • The VM will optimize the expression during the first few runs of the outer (fraction) loop, when fractionis 0, when the branches are almost never taken. The optimizer may do different things if you start fractionat 0.5.

  • Unless the VM is able to eliminate some of the array bounds checks here, there are four other branches in the expression just due to the bounds checks, and that's a complicating factor when trying to figure out what's happening at a low level. You might get different results if you split the two-dimensional array into two flat arrays, changing nums[0][i]and nums[1][i]to nums0[i]and nums1[i].

  • CPU branch predictors detect short patterns in the data, or runs of all branches being taken or not taken. Your randomly generated benchmark data is the worst-case scenario for a branch predictor. If real-world data has a predictable pattern, or it has long runs of all-zero and all-non-zero values, the branches could cost muchless.

  • The particular code that is executed after the condition is met can affect the performance of evaluating the condition itself, because it affects things like whether or not the loop can be unrolled, which CPU registers are available, and if any of the fetched numsvalues need to be reused after evaluating the condition. Merely incrementing a counter in the benchmark is not a perfect placeholder for what real code would do.

  • System.currentTimeMillis()is on most systems not more accurate than +/- 10 ms. System.nanoTime()is usually more accurate.

  • (a|b)!=0(a+b)!=0测试其中一个值是否为非零,而a != 0 && b != 0(a*b)!=0测试两者是否为非零。所以你不是在比较算术的时间:如果条件更频繁地为真,它会导致更多的if主体执行,这也需要更多的时间。

  • (a+b)!=0会对总和为零的正负值做错误的事情,所以你不能在一般情况下使用它,即使它在这里工作。

  • 同样,(a*b)!=0会对溢出的值做错误的事情。(随机示例: 196608 * 327680 是 0 因为真正的结果恰好可以被 2 32整除,所以它的低 32 位是 0,如果它是一个int操作,这些位就是你得到的全部。)

  • VM 将在外 ( fraction) 循环的前几次运行期间优化表达式,whenfraction为 0,此时分支几乎从未被采用。如果您从fraction0.5开始,优化器可能会做不同的事情。

  • 除非 VM 能够消除这里的一些数组边界检查,否则表达式中还有四个其他分支仅由于边界检查,当试图找出低级别发生的事情时,这是一个复杂的因素。如果将二维数组拆分为两个平面数组,将nums[0][i]and更改nums[1][i]nums0[i]and ,可能会得到不同的结果nums1[i]

  • CPU 分支预测器检测数据中的短模式,或所有分支的运行被采用或未采用。您随机生成的基准数据是分支预测器最坏情况。如果现实世界的数据具有可预测的模式,或者它具有全零和全非零值的长期运行,则分支的成本可能会低得多

  • 在满足条件后执行的特定代码会影响评估条件本身的性能,因为它会影响循环是否可以展开、哪些 CPU 寄存器可用以及是否nums需要获取任何值评估条件后重新使用。仅仅在基准测试中增加一个计数器并不是真正代码会做什么的完美占位符。

  • System.currentTimeMillis()在大多数系统上,精度不超过 +/- 10 毫秒。System.nanoTime()通常更准确。

There are lots of uncertainties, and it's always hard to say anything definite with these sorts of micro-optimizations because a trick that is faster on one VM or CPU can be slower on another. If running the 32-bit HotSpot JVM, rather than the 64-bit version, be aware that it comes in two flavors: with the "Client" VM having different (weaker) optimizations compared to the "Server" VM.

有很多不确定性,对于这些类型的微优化总是很难说任何确定的,因为在一个 VM 或 CPU 上更快的技巧在另一个 VM 或 CPU 上可能会更慢。如果运行 32 位 HotSpot JVM,而不是 64 位版本,请注意它有两种风格:与“服务器”VM 相比,“客户端”VM 具有不同(较弱)的优化。

If you can disassemble the machine code generated by the VM, do that rather than trying to guess what it does!

如果您可以反汇编 VM 生成的机器代码,那就去做,而不是试图猜测它的作用!

回答by Pagefault

The answers here are good, though I had an idea that might improve things.

这里的答案很好,尽管我有一个可能会改进的想法。

Since the two branches and associated branch prediction are the likely culprit, we may be able to reduce the branching to a single branch without changing the logic at all.

由于这两个分支和关联的分支预测可能是罪魁祸首,我们可以将分支减少到单个分支,而根本不改变逻辑。

bool aNotZero = (nums[0][i] != 0);
bool bNotZero = (nums[1][i] != 0);
if (aNotZero && bNotZero) { /* Some code */ }

It may also work to do

它也可能工作

int a = nums[0][i];
int b = nums[1][i];
if (a != 0 && b != 0) { /* Some code */ }

The reason being, by the rules of short circuiting, if the first boolean is false, the second should not be evaluated. It has to perform an extra branch to avoid evaluating nums[1][i]if nums[0][i]was false. Now, you may not care that nums[1][i]gets evaluated, but the compiler can't be certain that it won't throw an out of range or null ref when you do. By reducing the if block to simple bools, the compiler may be smart enough to realize that evaluating the second boolean unnecessarily won't have negative side effects.

原因是,根据短路规则,如果第一个布尔值为 false,则不应评估第二个布尔值。它必须执行额外的分支以避免评估nums[1][i]是否nums[0][i]为假。现在,您可能不在乎nums[1][i]被评估,但编译器无法确定它不会在您执行此操作时抛出超出范围或 null ref。通过将 if 块简化为简单的布尔值,编译器可能足够聪明,可以意识到不必要地评估第二个布尔值不会产生负面影响。

回答by StackedCrooked

You are using randomized input data which makes the branches unpredictable. In practice branches are often (~90%) predictable so in real code the branchful code is likely to be faster.

您正在使用随机输入数据,这使得分支不可预测。在实践中,分支通常 (~90%) 是可预测的,因此在实际代码中,分支代码可能会更快。

That said. I don't see how a*b != 0can be faster than (a|b) != 0. Generally integer multiplication is more expensive than a bitwise OR. But things like this occasionally get weird. See for example the "Example 7: Hardware complexities" example from Gallery of Processor Cache Effects.

那说。我不明白怎么a*b != 0能比(a|b) != 0. 通常整数乘法比按位 OR 更昂贵。但这样的事情偶尔会变得奇怪。例如,请参阅处理器缓存效果库中的“示例 7:硬件复杂性”示例。