java的UTF-16字符编码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20966802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 05:37:08  来源:igfitidea点击:

UTF-16 Character Encoding of java

javacharacter-encoding

提问by priyaranjan

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 character to byte i am getting 6 bytes as below, I am expecting it to be 12. Is there any concept i am missing ?

我试图理解 Java 中的字符编码。Java 中的字符使用 UTF-16 编码以 16 位存储。因此,当我将包含 6 个字符的字符串转换为字节时,我得到 6 个字节,如下所示,我希望它是 12。有什么我遗漏的概念吗?

package learn.java;

public class CharacterTest {

    public static void main(String[] args) {
        String str = "Hadoop";
        byte bt[] = str.getBytes();
        System.out.println("the length of character array is " + bt.length);
    } 
}

O/p :the length of character array is 6

O/p : 字符数组的长度为 6

As per @Darshan When trying with UTF-16 encoding to get bytes the result is also not expecting .

根据@Darshan 当尝试使用 UTF-16 编码来获取字节时,结果也出乎意料。

package learn.java;

    public class CharacterTest {

        public static void main(String[] args) {

            String str = "Hadoop";
            try{
                byte bt[] = str.getBytes("UTF-16");
                System.out.println("the length of character array is " + bt.length);

            }
            catch(Exception e)
            {

            }
        } 
    }

o/p: the length of character array is 14

采纳答案by tucuxi

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

在 UTF-16 版本中,由于插入了一个标记来区分 Big Endian(默认)和 Little Endian,因此您将获得 14 个字节。如果您指定 UTF-16LE,您将获得 12 个字节(小端,未添加字节顺序标记)。

See http://www.unicode.org/faq/utf_bom.html#gen7

http://www.unicode.org/faq/utf_bom.html#gen7



EDIT -Use this program to look into the actual bytes generated by different encodings:

编辑 -使用此程序查看由不同编码生成的实际字节:

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

例如,在 UTF-8 下运行时(在其他 JVM 默认编码下,FE 和 FF 的字符会显示不同),输出为:

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEt FF? 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00. 

And

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

回答by Evgeniy Dorofeev

String.getBytes()uses default platformencoding. Try this

String.getBytes()使用默认平台编码。尝试这个

byte bt[] = str.getBytes("UTF-16");

回答by Oleg Sklyar

I think this will help: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

我认为这会有所帮助:Joel Spolsky 的绝对最低要求每个软件开发人员绝对,肯定地必须了解 Unicode 和字符集(没有借口!)

And this will help as well: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding [...] The encoding is a variable-length encodingas code points are encoded with one or two 16-bit code units." (from Wikipedia)

这也会有所帮助:“UTF-16(16 位 Unicode 转换格式)是一种字符编码 [...] 该编码是一种可变长度编码,因为代码点是用一个或两个 16 位代码单元编码的.” (来自维基百科

回答by Seelenvirtuose

As per the String.getBytes()method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.

根据String.getBytes()方法的文档,使用平台的默认字符集将字符串编码为字节序列。

I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.

我假设,您的平台默认字符集将是 ISO-8859-1(或类似的每字符一字节字符集)。这些字符集将一个字符编码为一个字节。

If you want to specify the encoding, use the method String.getBytes(Charset)or String.getBytes(String).

如果要指定编码,请使用方法String.getBytes(Charset)String.getBytes(String)

About the 16-bit storing: This is how Java internallystores characters, so also strings. It is based on the original Unicode specification.

关于 16 位存储:这是 Java内部存储字符的方式,字符串也是如此。它基于原始的 Unicode 规范。

回答by Darshan Patel

For UTF-16encoding use str.getBytes("UTF-16");

对于UTF-16编码使用str.getBytes("UTF-16");

but it gives 14 length for byte[] please refer [link] http://rosettacode.org/wiki/String_lengthfor more details.

但它为 byte[] 提供了 14 个长度,请参阅 [link] http://rosettacode.org/wiki/String_length了解更多详细信息。