检查字符串是否是用 Java 编码的有效 UTF-8

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6622226/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 08:19:43  来源:igfitidea点击:

Check if a String is valid UTF-8 encoded in Java

javaencodingutf-8

提问by Michael Bavin

How can I check if a string is in valid UTF-8 format?

如何检查字符串是否为有效的 UTF-8 格式?

采纳答案by DArkO

Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.

只能检查字节数据。如果你构造了一个字符串,那么它在内部已经是 UTF-16 了。

Also onlybyte arrays can be UTF-8 encoded.

此外,只有字节数组可以进行 UTF-8 编码。

Here is a common case of UTF-8 conversions.

这是 UTF-8 转换的一个常见案例。

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try 
{
    myBytes = myString.getBytes("UTF-8");
} 
catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
    System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
    System.out.println(myBytes[i]);
}

If you don't know the encoding of your byte array, juniversalchardetis a library to help you detect it.

如果您不知道字节数组的编码,juniversalchardet是一个帮助您检测它的库。

回答by Bhanu PS Kushwah

The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.

以下帖子摘自官方 Java 教程,网址为:https: //docs.oracle.com/javase/tutorial/i18n/text/string.html

The StringConverter program starts by creating a String containing Unicode characters:

String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");

When printed, the String named original appears as:

Aê?üC

To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:

try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();

    String roundTrip = new String(utf8Bytes, "UTF8");
    System.out.println("roundTrip = " + roundTrip);
    System.out.println();
    printBytes(utf8Bytes, "utf8Bytes");
    System.out.println();
    printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes. The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java. Here is the printBytes method:

public static void printBytes(byte[] array, String name) {
    for (int k = 0; k < array.length; k++) {
        System.out.println(name + "[" + k + "] = " + "0x" +
            UnicodeFormatter.byteToHex(array[k]));
    }
}

The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:

utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

StringConverter 程序首先创建一个包含 Unicode 字符的字符串:

String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");

打印时,名为 original 的字符串显示为:

Aê?üC

要将 String 对象转换为 UTF-8,请调用 getBytes 方法并将适当的编码标识符指定为参数。getBytes 方法返回一个 UTF-8 格式的字节数组。要从非 Unicode 字节数组创建 String 对象,请使用 encoding 参数调用 String 构造函数。进行这些调用的代码包含在 try 块中,以防指定的编码不受支持:

try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();

    String roundTrip = new String(utf8Bytes, "UTF8");
    System.out.println("roundTrip = " + roundTrip);
    System.out.println();
    printBytes(utf8Bytes, "utf8Bytes");
    System.out.println();
    printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

StringConverter 程序打印出 utf8Bytes 和 defaultBytes 数组中的值以说明重要的一点:转换后的文本的长度可能与源文本的长度不同。一些 Unicode 字符转换为单个字节,其他字符转换为字节对或三元组。printBytes 方法通过调用在源文件 UnicodeFormatter.java 中定义的 byteToHex 方法来显示字节数组。这是 printBytes 方法:

public static void printBytes(byte[] array, String name) {
    for (int k = 0; k < array.length; k++) {
        System.out.println(name + "[" + k + "] = " + "0x" +
            UnicodeFormatter.byteToHex(array[k]));
    }
}

printBytes 方法的输出如下。请注意,两个数组中只有第一个和最后一个字节,即 A 和 C 字符是相同的:

utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43