java的UTF-16字符编码

Question

提问by priyaranjan

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 character to byte i am getting 6 bytes as below, I am expecting it to be 12. Is there any concept i am missing ?

我试图理解 Java 中的字符编码。Java 中的字符使用 UTF-16 编码以 16 位存储。因此，当我将包含 6 个字符的字符串转换为字节时，我得到 6 个字节，如下所示，我希望它是 12。有什么我遗漏的概念吗？

package learn.java;

public class CharacterTest {

    public static void main(String[] args) {
        String str = "Hadoop";
        byte bt[] = str.getBytes();
        System.out.println("the length of character array is " + bt.length);
    } 
}

O/p :the length of character array is 6

O/p : 字符数组的长度为 6

As per @Darshan When trying with UTF-16 encoding to get bytes the result is also not expecting .

根据@Darshan 当尝试使用 UTF-16 编码来获取字节时，结果也出乎意料。

package learn.java;

    public class CharacterTest {

        public static void main(String[] args) {

            String str = "Hadoop";
            try{
                byte bt[] = str.getBytes("UTF-16");
                System.out.println("the length of character array is " + bt.length);

            }
            catch(Exception e)
            {

            }
        } 
    }

o/p: the length of character array is 14

Answer 1

采纳答案by tucuxi

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

在 UTF-16 版本中，由于插入了一个标记来区分 Big Endian（默认）和 Little Endian，因此您将获得 14 个字节。如果您指定 UTF-16LE，您将获得 12 个字节（小端，未添加字节顺序标记）。

See http://www.unicode.org/faq/utf_bom.html#gen7

见http://www.unicode.org/faq/utf_bom.html#gen7

EDIT -Use this program to look into the actual bytes generated by different encodings:

编辑 -使用此程序查看由不同编码生成的实际字节：

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

例如，在 UTF-8 下运行时（在其他 JVM 默认编码下，FE 和 FF 的字符会显示不同），输出为：

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEt FF? 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

和

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

And

和

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

Answer 2

回答by Evgeniy Dorofeev

String.getBytes()uses default platformencoding. Try this

String.getBytes()使用默认平台编码。尝试这个

byte bt[] = str.getBytes("UTF-16");

Answer 3

回答by Oleg Sklyar

I think this will help: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

我认为这会有所帮助：Joel Spolsky 的绝对最低要求每个软件开发人员绝对，肯定地必须了解 Unicode 和字符集（没有借口！）

And this will help as well: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding [...] The encoding is a variable-length encodingas code points are encoded with one or two 16-bit code units." (from Wikipedia)

这也会有所帮助：“UTF-16（16 位 Unicode 转换格式）是一种字符编码 [...] 该编码是一种可变长度编码，因为代码点是用一个或两个 16 位代码单元编码的.” （来自维基百科）

Answer 4

回答by Seelenvirtuose

As per the String.getBytes()method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.

根据String.getBytes()方法的文档，使用平台的默认字符集将字符串编码为字节序列。

I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.

我假设，您的平台默认字符集将是 ISO-8859-1（或类似的每字符一字节字符集）。这些字符集将一个字符编码为一个字节。

If you want to specify the encoding, use the method String.getBytes(Charset)or String.getBytes(String).

如果要指定编码，请使用方法String.getBytes(Charset)或String.getBytes(String)。

About the 16-bit storing: This is how Java internallystores characters, so also strings. It is based on the original Unicode specification.

关于 16 位存储：这是 Java内部存储字符的方式，字符串也是如此。它基于原始的 Unicode 规范。

Answer 5

回答by Darshan Patel

For UTF-16encoding use str.getBytes("UTF-16");

对于UTF-16编码使用str.getBytes("UTF-16");

but it gives 14 length for byte[] please refer [link] http://rosettacode.org/wiki/String_lengthfor more details.

但它为 byte[] 提供了 14 个长度，请参阅 [link] http://rosettacode.org/wiki/String_length了解更多详细信息。

java的UTF-16字符编码

提问by priyaranjan

采纳答案by tucuxi

回答by Evgeniy Dorofeev

回答by Oleg Sklyar

回答by Seelenvirtuose

回答by Darshan Patel

相关推荐

最近更新

标签

java的UTF-16字符编码

提问by priyaranjan

采纳答案by tucuxi

回答by Evgeniy Dorofeev

回答by Oleg Sklyar

回答by Seelenvirtuose

回答by Darshan Patel

相关推荐

Java 更改窗口背景颜色的 JFrame 按钮

Java 在 try catch 中使用 Throwable 和 Exception 的区别

Java 将堆栈复制到数组

Java 带有 = 和 ; 的正则表达式

相关推荐

最近更新

标签