Java：从字节数组中删除连续的零段

Question

提问by Mike

For example, let's say I want to delete from the array all continuous segments of 0's longer than 3 bytes

例如，假设我想从数组中删除所有长度超过 3 个字节的 0 的连续段

byte a[] = {1,2,3,0,1,2,3,0,0,0,0,4};
byte r[] = magic(a);
System.out.println(r);

result

结果

{1,2,3,0,1,2,3,4}

I want to do something like a regular expression in Java, but on a byte array instead of a String.

我想做一些类似于 Java 中的正则表达式的事情，但是在字节数组而不是字符串上。

Is there something that can help me built-in (or is there a good third party tool), or do I need to work from scratch?

有什么可以帮助我内置的东西（或者有没有好的第三方工具），或者我需要从头开始工作？

Strings are UTF-16, so converting back and forth isn't a good idea? At least it's a lot of wasted overhead ... right?

字符串是 UTF-16，所以来回转换不是一个好主意？至少它浪费了很多开销......对吧？

Answer 1

采纳答案by objects

regex is not the tool for the job, you will instead need to implement that from scratch

regex 不是这项工作的工具，您需要从头开始实现它

Answer 2

回答by Alan Moore

byte[] a = {1,2,3,0,1,2,3,0,0,0,0,4};
String s0 = new String(a, "ISO-8859-1");
String s1 = s0.replaceAll("\x00{4,}", "");
byte[] r = s1.getBytes("ISO-8859-1");

System.out.println(Arrays.toString(r)); // [1, 2, 3, 0, 1, 2, 3, 4]

I used ISO-8859-1 (latin1) because, unlike any other encoding,

我使用 ISO-8859-1 (latin1) 因为，与任何其他编码不同，

every byte in the range 0x00..0xFFmaps to a valid character, and
each of those characters has the same numeric value as its latin1 encoding.

范围内的每个字节都0x00..0xFF映射到一个有效字符，并且
这些字符中的每一个都具有与其 latin1 编码相同的数值。

That means the string is the same length as the original byte array, you can match any byte by its numeric value with the \xFFconstruct, and you can convert the resulting string back to a byte array without losing information.

这意味着字符串与原始字节数组的长度相同，您可以通过其数值与\xFF构造匹配任何字节，并且您可以将结果字符串转换回字节数组而不会丢失信息。

I wouldn't try to displaythe data while it's in string form--although all the characters are valid, many of them are not printable. Also, avoid manipulating the data while it's in string form; you might accidentally do some escape-sequence substitutions or another encoding conversion without realizing it. In fact, I wouldn't recommenddoing this kind of thing at all, but that isn't what you asked. :)

我不会尝试以字符串形式显示数据——尽管所有字符都是有效的，但其中许多是不可打印的。另外，避免在数据为字符串形式时对其进行操作；您可能会无意中执行一些转义序列替换或其他编码转换而没有意识到这一点。事实上，我根本不建议做这种事情，但这不是你问的。:)

Also, be aware that this technique won't necessarily work in other programming languages or regex flavors. You would have to test each one individually.

另外，请注意，此技术不一定适用于其他编程语言或正则表达式风格。您必须单独测试每一个。

Answer 3

回答by Lawrence Dol

Though I question whether reg-ex is the right tool for the job, if you do want to use one I'd suggest you just implement a CharSequence wrapper on a byte array. Something like this (I just wrote this directly in, not compiled... but you get the idea).

尽管我怀疑 reg-ex 是否适合这项工作，但如果您确实想使用它，我建议您只在字节数组上实现 CharSequence 包装器。像这样的东西（我只是直接写了这个，没有编译……但你明白了）。

public class ByteChars 
implements CharSequence

...

ByteChars(byte[] arr) {
    this(arr,0,arr.length);
    }

ByteChars(byte[] arr, int str, int end) {
    //check str and end are within range here
    strOfs=str;
    endOfs=end;
    bytes=arr;
    }

public char charAt(int idx) { 
    //check idx is within range here
    return (char)(bytes[strOfs+idx]&0xFF); 
    }

public int length() { 
    return (endOfs-strOfs); 
    }

public CharSequence subSequence(int str, int end) { 
    //check str and end are within range here
    return new ByteChars(arr,(strOfs+str,strOfs+end); 
    }

public String toString() { 
    return new String(bytes,strOfs,(endOfs-strOfs),"ISO8859_1");
    }

Answer 4

回答by Jo?o Silva

I don't see how regex would be useful to do what you want. One thing you can do is use Run Length Encodingto encode that byte array, replace every ocurrence of "30" (read three 0's) with the empty string, and decode the final string. Wikipedia has a simple Java implementation of it.

我不明白正则表达式对做你想做的事会有多大用处。您可以做的一件事是使用运行长度编码对该字节数组进行编码，用空字符串替换每次出现的“30”（读取三个 0），并解码最终字符串。维基百科有一个简单的 Java 实现。

Answer 5

回答by Jonathan Graehl

Although there's a reasonable ByteStringlibrary floating around, nobody that I've seen has implemented a general regexp library on them.

尽管有一个合理的ByteString库，但我见过的没有人在它们上面实现了一个通用的 regexp 库。

I recommend solving your problem directly rather than implementing a regexp library :)

我建议直接解决您的问题，而不是实施正则表达式库:)

If you do convert to string and back, you probably won't find any existing encoding that gives you a round trip for your 0 bytes. If that's the case, you'd have to write your own byte array <-> string converters; not worth the trouble.

如果您确实转换为字符串并返回，您可能找不到任何现有的编码来为您的 0 字节提供往返行程。如果是这种情况，您必须编写自己的字节数组 <-> 字符串转换器；不值得麻烦。

Answer 6

回答by try-catch-finally

The implementation utilizing a Regular Expression, proposed by other answers, is up to 8 times slower than a naive implementation using a loop that copies bytes from the input array to an output array.

使用其他答案提出的正则表达式的实现比使用循环将字节从输入数组复制到输出数组的简单实现慢 8 倍。

The implementation copies an input array byte by byte. If a zero sequence was detected, the output array index is reduced (rewound). After processing the input array, the output array is even copied once more to trim its length to the actual number of bytes since the intermediate output array is initialized with the length of the input array.

该实现逐字节复制输入数组。如果检测到零序列，则输出数组索引会减少（倒带）。在处理输入数组之后，输出数组甚至被再次复制以将其长度修剪为实际字节数，因为中间输出数组是用输入数组的长度初始化的。

/**
 * Remove four or more zero byte sequences from the input array.
 *  
 * @param inBytes the input array 
 * @return a new array with four or more zero bytes removed form the input array
 */
private static byte[] removeDuplicates(byte[] inBytes) {
    int size = inBytes.length;
    // Use an array with the same size in the first place
    byte[] newBytes = new byte[size];
    byte value;
    int newIdx = 0;
    int zeroCounter = 0;

    for (int i = 0; i < size; i++) {
        value = inBytes[i];

        if (value == 0) {
            zeroCounter++;
        } else {
            if (zeroCounter >= 4) {
                // Rewind output buffer index
                newIdx -= zeroCounter;
            }

            zeroCounter = 0;
        }

        newBytes[newIdx] = value;
        newIdx++;
    }

    if (zeroCounter >= 4) {
        // Rewind output buffer index for four zero bytes at the end too
        newIdx -= zeroCounter;
    }

    // Copy data into an array that has the correct length
    byte[] finalOut = new byte[newIdx];
    System.arraycopy(newBytes, 0, finalOut, 0, newIdx);

    return finalOut;
}

A second approach that would prevent unnecessary copies by rewinding to the first zero byte (of three or less) and copying those elements was interestingly a bit slower than the first approach.

通过回退到第一个零字节（三个或更少）并复制这些元素来防止不必要的复制的第二种方法有趣的是比第一种方法慢一点。

All three implementations were tested on a Pentium N3700 processor with 1,000 iterations over a 8 x 32KB input array with several amounts and lengths of zero sequences. The worst performance improvement compared to the Regular Expression approach was 1.5x faster.

所有三种实现都在 Pentium N3700 处理器上进行了测试，在 8 x 32KB 输入数组上进行了 1,000 次迭代，其中包含多个数量和长度的零序列。与正则表达式方法相比，最差的性能改进速度提高了1.5 倍。

The full test rig can be found here: https://pastebin.com/83q9EzDc

完整的测试设备可以在这里找到：https: //pastebin.com/83q9EzDc

Answer 7

回答by brianegge

I'd suggest converting the byte array into a String, performing the regex, and then converting it back. Here's a working example:

我建议将字节数组转换为字符串，执行正则表达式，然后将其转换回来。这是一个工作示例：

public void testRegex() throws Exception {
    byte a[] = { 1, 2, 3, 0, 1, 2, 3, 0, 0, 0, 0, 4 };
    String s = btoa(a);
    String t = s.replaceAll("\u0000{4,}", "");
    byte b[] = atob(t);
    System.out.println(Arrays.toString(b));
}

private byte[] atob(String t) {
    char[] array = t.toCharArray();
    byte[] b = new byte[array.length];
    for (int i = 0; i < array.length; i++) {
        b[i] = (byte) Character.toCodePoint('\u0000', array[i]);
    }
    return b;
}

private String btoa(byte[] a) {
    StringBuilder sb = new StringBuilder();
    for (byte b : a) {
        sb.append(Character.toChars(b));
    }
    return sb.toString();
}

For more complicated transformations, I'd suggest using a Lexer. Both JavaCC and ANTLR have support for parsing/transforming binary files.

对于更复杂的转换，我建议使用 Lexer。JavaCC 和 ANTLR 都支持解析/转换二进制文件。

Answer 8

回答by Amber

Java Regex operates on CharSequences - you could CharBufferto wrap your existing byte array (you might need to cast it to char[] ?) and interpret it as such, and then perform regex on that?

Java Regex 在 CharSequences 上运行 - 您可以使用CharBuffer来包装现有的字节数组（您可能需要将其转换为 char[] ？）并将其解释为这样，然后对其执行正则表达式？

Java：从字节数组中删除连续的零段

提问by Mike

采纳答案by objects

回答by Alan Moore

回答by Lawrence Dol

回答by Jo?o Silva

回答by Jonathan Graehl

回答by try-catch-finally

回答by brianegge

回答by Amber

相关推荐

最近更新

标签

Java：从字节数组中删除连续的零段

提问by Mike

采纳答案by objects

回答by Alan Moore

回答by Lawrence Dol

回答by Jo?o Silva

回答by Jonathan Graehl

回答by try-catch-finally

回答by brianegge

回答by Amber

相关推荐

让 Java JDK 在 ubuntu 上编译

来自 Java 的肥皂中的 UserNameToken

使用 Java 和 REST 发送 Apple 推送通知时出现问题

java 常量集合应该放在类还是接口中？

相关推荐

最近更新

标签