Java 检查字节数组是否全为零的最快方法

Question

提问by user2555349

I have a byte[4096]and was wondering what the fastest way is to check if all values are zero?

我有一个byte[4096]并且想知道检查所有值是否为零的最快方法是什么？

Is there any way faster than doing:

有没有比做更快的方法：

byte[] b = new byte[4096];
b[4095] = 1;
for(int i=0;i<b.length;i++)
    if(b[i] != 0)
        return false; // Not Empty

Answer 1

采纳答案by skiwi

I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. Also I have changed the JVM warmup to be correct now.

我在第一次对所有字节求和时重写了这个答案，但是这是不正确的，因为 Java 已经对字节进行了签名，因此我需要或。此外，我已将 JVM 预热更改为现在正确。

Your best bet really is to simply loop over all values.

最好的办法是简单地循环遍历所有值。

I suppose you have three major options available:

我想你有三个主要的选择：

Or all elements and check the sum.
Do branchless comparisons.
Do comparisons with a branch.

或所有元素并检查总和。
进行无分支比较。
与分支进行比较。

I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons.

我不知道使用 Java（低级性能）添加字节的性能有多好，我知道如果您进行分支比较，Java 会使用（低级）分支预测器。

Therefore I expect the following to happen on:

因此，我希望发生以下情况：

byte[] array = new byte[4096];
for (byte b : array) {
    if (b != 0) {
        return false;
    }
}

Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself.
Very fast branch comparisons due to branch prediction as every value should be zero anyway.

当分支预测器仍在播种时，前几次迭代中的比较相对较慢。
由于分支预测，分支比较非常快，因为无论如何每个值都应该为零。

If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array.

如果它会达到非零值，则分支预测器将失败，导致比较变慢，但是您也处于计算的末尾，因为您想以任何一种方式返回 false。我认为一个失败的分支预测的成本比继续迭代数组的成本小一个数量级。

I furthermore believethat for (byte b : array)should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIteratorwhich would cause some extra method calls (as iterating over a list) until the code gets inlined.

我还认为是for (byte b : array)应该被允许，因为它应该得到直接编译到索引数组迭代的，据我所知，作为没有这样的事情PrimitiveArrayIterator，这将导致一些额外的方法调用（如迭代一个列表），直到代码被内联。

Update

更新

I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly.

我编写了自己的基准测试，它们给出了一些有趣的结果......不幸的是，我无法使用任何现有的基准测试工具，因为它们很难正确安装。

I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. And the condition here is x > 0and hence a or of zero is a noop presumably.

我还决定将选项 1 和 2 组合在一起，因为我认为它们实际上与无分支的你通常或所有东西（减去条件）相同，然后检查最终结果。这里的条件是x > 0，因此 a or of zero 大概是一个 noop 。

The code:

编码：

public class Benchmark {
    private void start() {
        //setup byte arrays
        List<byte[]> arrays = createByteArrays(700_000);

        //warmup and benchmark repeated
        arrays.forEach(this::byteArrayCheck12);
        benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");

        arrays.forEach(this::byteArrayCheck3);
        benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");

        arrays.forEach(this::byteArrayCheck4);
        benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");

        arrays.forEach(this::byteArrayCheck5);
        benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
    }

    private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
        long start = System.nanoTime();
        arrays.forEach(method);
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }

    private List<byte[]> createByteArrays(final int amount) {
        Random random = new Random();
        List<byte[]> resultList = new ArrayList<>();
        for (int i = 0; i < amount; i++) {
            byte[] byteArray = new byte[4096];
            byteArray[random.nextInt(4096)] = 1;
            resultList.add(byteArray);
        }
        return resultList;
    }

    private boolean byteArrayCheck12(final byte[] array) {
        int sum = 0;
        for (byte b : array) {
            sum |= b;
        }
        return (sum == 0);
    }

    private boolean byteArrayCheck3(final byte[] array) {
        for (byte b : array) {
            if (b != 0) {
                return false;
            }
        }
        return true;
    }

    private boolean byteArrayCheck4(final byte[] array) {
        return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
    }

    private boolean byteArrayCheck5(final byte[] array) {
        return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
    }

    public static void main(String[] args) {
        new Benchmark().start();
    }
}

The surprising results:

令人惊讶的结果：

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns

基准：byteArrayCheck12 /迭代：700000 /每次迭代时间：50.18817142857143ns
基准：byteArrayCheck3 /迭代：每次迭代700000 /时间：767.7371985714286ns
基准：byteArrayCheck4 /迭代：每次迭代700000 /时间：21145.03219857143ns
基准：byteArrayCheck5 /迭代：700000 /每次迭代时间：10376.119144285714ns

This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done.

这表明 orring 比分支预测器快很多，这相当令人惊讶，所以我假设正在完成一些低级优化。

As extra I've included the stream variants, which I did not expect to be that fast anyhow.

作为额外的我已经包括了流变体，无论如何我没想到它会那么快。

Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM.

在原厂时钟英特尔 i7-3770、16GB 1600MHz RAM 上运行。

So I think the final answer is: It depends. It depends on how many times you are going to check the array consecutively. The "byteArrayCheck3" solution is always steadily at 700~800ns.

所以我认为最终的答案是：视情况而定。这取决于您要连续检查数组的次数。“byteArrayCheck3”方案始终稳定在700~800ns。

Follow up update

跟进更新

Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all.

事情实际上采取了另一种有趣的方法，结果是由于根本没有使用结果变量，JIT 优化了几乎所有的计算。

Thus I have the following new benchmarkmethod:

因此，我有以下新benchmark方法：

private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (byte[] array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12method was void, as it noticed that the (sum == 0)was not being used, hence it optimized away the entire method.

这确保了基准测试的结果不能被优化掉，因此主要问题是该byteArrayCheck12方法无效，因为它注意到(sum == 0)没有被使用，因此它优化了整个方法。

Thus we have the following new result (omitted the result prints for clarity):

因此，我们有以下新结果（为清楚起见省略了结果打印）：

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns

基准：byteArrayCheck12 /迭代：700000 /每次迭代时间：1370.6987942857143ns
基准：byteArrayCheck3 /迭代：每次迭代700000 /时间：736.1096242857143ns
基准：byteArrayCheck4 /迭代：每次迭代700000 /时间：20671.230327142857ns
基准：byteArrayCheck5 /迭代：700000 /每次迭代时间：9845.388841428572ns

Hence we think that we can finally conclude that branch prediction wins. It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early:

因此我们认为我们最终可以得出分支预测获胜的结论。然而，它也可能由于提前返回而发生，因为平均而言，违规字节将位于字节数组的中间，因此是时候使用另一种不提前返回的方法了：

private boolean byteArrayCheck3b(final byte[] array) {
    int hits = 0;
    for (byte b : array) {
        if (b != 0) {
            hits++;
        }
    }
    return (hits == 0);
}

In this way we still benefit from the branch prediction, however we make sure that we cannot return early.

通过这种方式，我们仍然受益于分支预测，但是我们确保我们不能提前返回。

Which in turn gives us more interesting results again!

这反过来又给了我们更多有趣的结果！

Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns
Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns
Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns
Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns
Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns

基准测试：byteArrayCheck12 / 迭代次数：700000 / 每次迭代时间：1327.2817714285713ns
基准测试：byteArrayCheck3 / 迭代次数：700000 / 每次迭代时间：753.31376ns
基准测试：byteArrayCheck3b / 迭代次数：/75106ns per迭代：
/ 75106ns基准测试：/751060406040604次迭代：/751040404次迭代：/7510404次/75004次迭代每次迭代时间：21655.950115714284ns
基准：byteArrayCheck5 / 迭代次数：700000 / 每次迭代时间：10608.70917857143ns

I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. I suspect that all of those operations are highly optimized in native code.

我认为我们最终可以得出结论，最快的方法是使用早期返回和分支预测，然后是 orring，然后是纯粹的分支预测。我怀疑所有这些操作都在本机代码中进行了高度优化。

Update, some additional benchmarking using long and int arrays.

更新，一些额外的使用 long 和 int 数组的基准测试。

After seeing suggestions on using long[]and int[]I decided it was worth investigating. However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting.

在看到有关使用的建议后long[]，int[]我认为值得研究。然而，这些尝试可能不再完全符合原始答案，但仍然可能很有趣。

Firstly, I changed the benchmarkmethod to use generics:

首先，我改变了benchmark使用泛型的方法：

private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
    long start = System.nanoTime();
    boolean someUnrelatedResult = false;
    for (T array : arrays) {
        someUnrelatedResult |= method.test(array);
    }
    long end = System.nanoTime();
    double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
    System.out.println("Result: " + someUnrelatedResult);
    System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
}

Then I performed conversions from byte[]to long[]and int[]respectively beforethe benchmarks, it was also neccessary to set the maximum heap size to 10 GB.

然后我在基准测试之前分别执行了从byte[]tolong[]和的转换，还需要将最大堆大小设置为 10 GB。int[]

List<long[]> longArrays = arrays.stream().map(byteArray -> {
    long[] longArray = new long[4096 / 8];
    ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
    return longArray;
}).collect(Collectors.toList());
longArrays.forEach(this::byteArrayCheck8);
benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");

List<int[]> intArrays = arrays.stream().map(byteArray -> {
    int[] intArray = new int[4096 / 4];
    ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
    return intArray;
}).collect(Collectors.toList());
intArrays.forEach(this::byteArrayCheck9);
benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");

private boolean byteArrayCheck8(final long[] array) {
    for (long l : array) {
        if (l != 0) {
            return false;
        }
    }
    return true;
}

private boolean byteArrayCheck9(final int[] array) {
    for (int i : array) {
        if (i != 0) {
            return false;
        }
    }
    return true;
}

Which gave the following results:

这给出了以下结果：

Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns
Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns

基准：byteArrayCheck8 / 迭代：700000 / 每次迭代时间：259.8157614285714ns
基准：byteArrayCheck9 / 迭代：700000 / 每次迭代时间：266.38013714285717ns

This path may be worth exploring if it is possibly to get the bytes in such format. However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself.

如果可能以这种格式获取字节，则此路径可能值得探索。然而，在基准方法中进行转换时，每次迭代的时间约为 2000 纳秒，因此当您需要自己进行转换时，这是不值得的。

Answer 2

回答by Christophe

I think that theoretically your way in the fastest way, in practice you might be able to make use of larger comparisons as suggested by one of the commenters (1 byte comparison takes 1 instruction, but so does an 8-byte comparison on a 64-bit system).

我认为理论上你的方式是最快的，实际上你可以按照其中一位评论者的建议使用更大的比较（1 字节比较需要 1 条指令，但 64 字节的 8 字节比较也是如此）位系统）。

Also in languages closer to the hardware (C and variants) you can make use of something called vectorization where you could perform a number of the comparisons/additions simultaneously. It looks like Java still doesn't have native support for it but based on this answeryou might be able to get some use of it.

同样在更接近硬件的语言（C 和变体）中，您可以使用称为矢量化的东西，您可以在其中同时执行许多比较/添加。看起来 Java 仍然没有对它的本机支持，但根据这个答案，您可能可以使用它。

Also in line with the other comments I would say that with a 4k buffer it's probably not worth the time to try and optimize it (unless it is being called very often)

同样与其他评论一致，我会说使用 4k 缓冲区可能不值得花时间尝试优化它（除非它经常被调用）

Answer 3

回答by VGR

Someone suggested checking 4 or 8 bytes at a time. You actually can do this in Java:

有人建议一次检查 4 或 8 个字节。你实际上可以在 Java 中做到这一点：

LongBuffer longBuffer = ByteBuffer.wrap(b).asLongBuffer();
while (longBuffer.hasRemaining()) {
    if (longBuffer.get() != 0) {
        return false;
    }
}
return true;

Whether this is faster than checking byte values is uncertain, since there is so much potential for optimization.

这是否比检查字节值更快尚不确定，因为优化潜力很大。

Answer 4

回答by Mallox

This may not be the fastest or most memory performant solution but it's a one liner:

这可能不是最快或最高内存性能的解决方案，但它是一个单行：

byte[] arr = randomByteArray();
assert Arrays.equals(arr, new byte[arr.length]);

Answer 5

回答by Chalk

For Java 8, you can simply use this:

对于 Java 8，你可以简单地使用这个：

public static boolean isEmpty(final byte[] data){
    return IntStream.range(0, data.length).parallel().allMatch(i -> data[i] == 0);
}

Java 检查字节数组是否全为零的最快方法

提问by user2555349

采纳答案by skiwi

回答by Christophe

回答by VGR

回答by Mallox

回答by Chalk

相关推荐

最近更新

标签

Java 检查字节数组是否全为零的最快方法

提问by user2555349

采纳答案by skiwi

回答by Christophe

回答by VGR

回答by Mallox

回答by Chalk

相关推荐

Java junit：使用 mockmvc 测试控制器时注入的 bean 上的 NullPointer

什么是最好的开源纯 Java 计算机视觉库？

java.sql.SQLException：用尽的结果集

Spring/Java 错误：命名空间元素 'annotation-config' ... 在 JDK 1.5 及更高版本上

相关推荐

最近更新

标签