一旦UTF-8编码，如何截断java字符串以适合给定的字节数？-IGI

时间：2020-03-06 14:34:55 　来源:igfitidea点击:

我如何截断一个JavaString，以便我知道一旦它以UTF-8编码，它将适合给定数量的字节存储？

解决方案

UTF-8编码具有简洁的特征，可让我们查看字节集中的位置。

以所需的字符数限制检查流。

如果其高位为0，则为单字节字符，只需将其替换为0，就可以了。
如果它的高位是1，下一位也是，那么我们就处于一个多字节char的开头，因此只要将该字节设置为0，就可以了。
如果高位为1但下一位为0，则说明我们处于字符中间，沿着缓冲区向后移动，直到找到高位中具有2个或者更多1的字节，然后将该字节替换为0。

示例：如果流是：31 33 31 C1 A3 32 33 00，则可以将字符串的长度为1、2、3、5、6或者7个字节，但不能为4个字节，因为那样会在C1之后放置0。是多字节char的开始。

我们应该使用CharsetEncoder，它是简单的getBytes()+复制，可以将UTF-8字符切成两半。

像这样的东西：

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

这是一个简单的循环，用于计算UTF-8表示形式的大小，并在超出时截断：

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

这确实可以处理出现在输入字符串中的代理对。 Java的UTF-8编码器(正确)将代理对输出为单个4字节序列而不是两个3字节序列，因此truncateWhenUTF8()将返回其最长的截断字符串。如果我们在实现中忽略代理对，则截短的字符串可能会短于所需的长度。

我没有对该代码做很多测试，但是这里有一些初步测试：

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

更新了修改的代码示例，现在可以处理代理对。

我们无需进行任何转换即可计算字节数。

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

我们将必须检测代理对(D800-DBFF和U + DC00U + DFFF)，并为每个有效代理对计数4个字节。如果我们在第一个范围内获得了第一个值，而在第二个范围内获得了第二个，就可以了，跳过它们并添加4.
但是，如果不是，则它是无效的代理对。我不确定Java如何处理该问题，但是在这种情况下(可能)，算法必须正确计数。

一旦UTF-8编码，如何截断java字符串以适合给定的字节数？

解决方案

相关推荐

最近更新

标签

一旦UTF-8编码，如何截断java字符串以适合给定的字节数？

解决方案

相关推荐

我们能建议比java.util.Properties先进的东西吗？

有没有办法在执行XNA开发时验证代码是否可以在360上正常工作？

如何在Ruby中生成n个唯一随机数的列表？

在家工作的条件编译

相关推荐

最近更新

标签