UTF-8 编码后，如何截断 java 字符串以适应给定数量的字节？

Question

提问by Johan Lübcke

How do I truncate a java Stringso that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?

如何截断一个 javaString以便我知道它一旦被 UTF-8 编码就会适合给定数量的字节存储？

Answer 1

采纳答案by Matt Quail

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

这是一个简单的循环，它计算 UTF-8 表示将有多大，并在超过时截断：

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

This doeshandle surrogate pairsthat appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8()will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

这确实处理出现在输入字符串中的代理对。Java 的 UTF-8 编码器（正确地）将代理对作为单个 4 字节序列而不是两个 3 字节序列输出，因此truncateWhenUTF8()将返回最长的截断字符串。如果您在实现中忽略代理对，那么被截断的字符串可能比它们需要的更短。

I haven't done a lot of testing on that code, but here are some preliminary tests:

我没有对该代码进行大量测试，但这里有一些初步测试：

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

UpdatedModified code example, it now handles surrogate pairs.

更新修改后的代码示例，它现在处理代理对。

Answer 2

回答by billjamesdev

UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.

UTF-8 编码有一个简洁的特性，可以让您查看您在字节集中的位置。

check the stream at the character limit you want.

检查您想要的字符限制的流。

If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.

如果它的高位为 0，则它是一个单字节字符，只需将其替换为 0 即可。
如果它的高位是 1，下一位也是，那么你就在一个多字节字符的开头，所以只需将该字节设置为 0 就可以了。
如果高位是 1 但下一位是 0，那么你就在一个字符的中间，沿着缓冲区返回，直到你遇到一个高位有 2 个或更多 1 的字节，然后用0.

Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.

示例：如果您的流是：31 33 31 C1 A3 32 33 00，您可以将字符串设为 1、2、3、5、6 或 7 个字节长，但不能设为 4，因为这会将 0 放在 C1 之后，即是多字节字符的开始。

Answer 3

回答by mitchnull

You should use CharsetEncoder, the simple getBytes()+ copy as many as you can can cut UTF-8 charcters in half.

您应该使用CharsetEncoder，getBytes()尽可能多的简单+ 复制可以将 UTF-8 字符减半。

Something like this:

像这样的东西：

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

Answer 4

回答by user19050

You can calculate the number of bytes without doing any conversion.

您可以在不进行任何转换的情况下计算字节数。

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.

您必须检测代理对（D800-DBFF 和 U+DC00–U+DFFF）并为每个有效代理对计算 4 个字节。如果你得到第一个范围的第一个值和第二个范围的第二个值，那就没问题，跳过它们并添加 4。但如果不是，那么它是一个无效的代理对。我不确定 Java 是如何处理的，但是您的算法必须在这种（不太可能的）情况下进行正确计数。

Answer 5

回答by sigget

Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_thewith checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.

这是我想出的，它使用标准的 Java API，所以应该是安全的，并且与所有的 unicode 奇怪和代理对等兼容。解决方案来自http://www.jroller.com/holy/entry/truncating_utf_string_to_the并带有检查添加 null 并避免在字符串的字节数少于maxBytes时进行解码。

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

Answer 6

回答by Suresh Gupta

you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

你可以使用 -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

UTF-8 编码后，如何截断 java 字符串以适应给定数量的字节？

提问by Johan Lübcke

采纳答案by Matt Quail

回答by billjamesdev

回答by mitchnull

回答by user19050

回答by sigget

回答by Suresh Gupta

相关推荐

最近更新

标签

UTF-8 编码后，如何截断 java 字符串以适应给定数量的字节？

提问by Johan Lübcke

采纳答案by Matt Quail

回答by billjamesdev

回答by mitchnull

回答by user19050

回答by sigget

回答by Suresh Gupta

相关推荐

Java 集合的 hashCode 方法的最佳实现

java异常的catch块中是否会捕获断言错误？

在 Java 中序列化日期

在 RxJava 中将 Observable<List<Car>> 转换为 Observable<Car> 的序列

相关推荐

最近更新

标签