Java 中的 UTF-8 和 UTF-16

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12946388/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 10:54:10  来源:igfitidea点击:

UTF-8 and UTF-16 in Java

javastringencodingutf-8

提问by GMsoF

I really expect that the byte data below should show differently, but in fact, they are same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples, the encoding in byte look different, but why Java print them out as the same?

我真的希望下面的字节数据应该显示不同,但实际上,它们是相同的,根据 wiki http://en.wikipedia.org/wiki/UTF-8#Examples,字节中的编码看起来不同,但为什么Java 把它们打印出来一样吗?

    String a = "";
    byte[] utf16 = a.getBytes(); //Java default UTF-16
    byte[] utf8 = null;

    try {
        utf8 = a.getBytes("UTF-8");
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }

    for (int i = 0 ; i < utf16.length ; i ++){
        System.out.println("utf16 = " + utf16[i]);
    }

    for (int i = 0 ; i < utf8.length ; i ++){
        System.out.println("utf8 = " + utf8[i]);
    }

回答by Adrian Pronk

Although Java holds characters internally as UTF-16, when you convert to bytes using String.getBytes(), each character is converted using the default platform encoding which will likely be something like windows-1252. The results I'm getting are:

尽管 Java 在内部将字符保存为 UTF-16,但当您使用 转换为字节时String.getBytes(),每个字符都会使用默认平台编码进行转换,这可能类似于windows-1252。我得到的结果是:

utf16 = -30
utf16 = -126
utf16 = -84
utf8 = -30
utf8 = -126
utf8 = -84

This indicates that the default encoding is "UTF-8" on my system.

这表明我的系统上的默认编码是“UTF-8”。

Also note that the documentation for String.getBytes() has this comment: The behavior of this method when this string cannot be encoded in the default charset is unspecified.

还要注意 String.getBytes() 的文档有这样的评论: The behavior of this method when this string cannot be encoded in the default charset is unspecified.

Generally, though, you'll avoid confusion if you always specify an encoding like you do with a.getBytes("UTF-8")

但是,通常,如果您总是像使用时那样指定编码,则可以避免混淆 a.getBytes("UTF-8")

Also, another thing that can cause confusion is including Unicode characters directly in your source file: String a = "";. That euro symbol has to be encoded to be stored as one or more bytes in a file. When Java compiles your program, it sees those bytes and decodes them back into the euro symbol. You hope. You have to be sure that the software that save the euro symbol into the file (Notepad, eclipse, etc) encodes it the same way as Java expects when it reads it back in. UTF-8 is becoming more popular but it is not universal and many editors will not write files in UTF-8.

此外,另一件可能引起混淆的事情是直接在源文件中包含 Unicode 字符:String a = "";. 该欧元符号必须进行编码才能在文件中存储为一个或多个字节。当 Java 编译您的程序时,它会看到这些字节并将它们解码回欧元符号。你希望。您必须确保将欧元符号保存到文件中的软件(记事本、eclipse 等)对它进行编码,其编码方式与 Java 在读回它时所期望的相同。UTF-8 正变得越来越流行,但它并不通用而且很多编辑器不会用UTF-8写文件。

回答by Stephen C

One curiosity, I wonder how JVM know the original default charset ...

一个好奇心,我想知道 JVM 是如何知道原始默认字符集的......

The mechanism that the JVM uses to determine the initial default charset is platform specific. On UNIX / UNIX-like systems, it is determined by the LANG and LC_* environment variables; see man locale.

JVM 用于确定初始默认字符集的机制是特定于平台的。在 UNIX/类 UNIX 系统上,由 LANG 和 LC_* 环境变量决定;见man locale



Ermmm.. This command is used to check what is the default charset in specific OS?

Ermmm.. 这个命令用于检查特定操作系统中的默认字符集是什么?

That is correct. But I told you about it because the manual entry describes howthe default encoding is determined by the environment variables.

那是对的。但我告诉你它是因为手册条目描述了默认编码是如何由环境变量确定的。

In retrospect, this may not been what you meant by your original comment, but this IS how the platform default encoding is specified. (And the concept of a "default character set" for an individual file is meaningless; see below.)

回想起来,这可能不是您最初评论的意思,但这就是平台默认编码的指定方式。(单个文件的“默认字符集”的概念是没有意义的;见下文。)

What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16, after compile, I move them (class file) into another OS platform, now how JVM know their default encoding? Will the default charset information be included in the Java class file?

如果假设我有 10 个 Java 源文件,其中一半保存为 UTF-8,其余保存为 UTF-16,编译后,我将它们(类文件)移动到另一个操作系统平台,现在 JVM 如何知道它们的默认编码? Java 类文件中会包含默认字符集信息吗?

That is a rather confused set of questions:

这是一组相当混乱的问题:

  1. A text file doesn't have a default character set. It has a character set / encoding.

  2. A non-text file doesn't have an character encoding at all. The concept is meaningless.

  3. There's no 100% reliable way to determine what a text file's character encoding is.

  4. If you don't tell the java compiler what the file's encoding is, it will assume that it is the platform's default encoding. The compiler doesn't try to second guess you. If you get the encoding incorrect, the compiler may or may not even notice your mistake.

  5. Bytecode (".class") files are binary files (see 2).

  6. When Character and String literals are compiled into a ".class" file, they are NOW represented in a way that is not affected by the platform default encoding, or anything else that you can influence.

  7. If you made a mistake with the source file encoding when compiling, you can't fix it at the ".class" file level. Your only option is to go back and recompile the classes, telling the Java compiler the correct source file encoding.

  8. "What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16".
    Just don't do it!

    • Don't save your source files in a mixture of encodings. You will drive yourself nuts.
    • I can't thing of a good reason to store files in UTF-16 at all ...
  1. 文本文件没有默认字符集。它有一个字符集/编码。

  2. 非文本文件根本没有字符编码。这个概念毫无意义。

  3. 没有 100% 可靠的方法来确定文本文件的字符编码是什么。

  4. 如果你不告诉 java 编译器文件的编码是什么,它会假设它是平台的默认编码。编译器不会试图猜测你。如果编码不正确,编译器可能会也可能不会注意到您的错误。

  5. 字节码(“.class”)文件是二进制文件(见 2)。

  6. 当字符和字符串文字被编译成“.class”文件时,它们现在以不受平台默认编码或任何您可以影响的任何其他方式影响的方式表示。

  7. 如果编译时源文件编码有误,则无法在“.class”文件级别进行修复。您唯一的选择是返回并重新编译类,告诉 Java 编译器正确的源文件编码。

  8. "假设我有 10 个 Java 源文件,其中一半保存为 UTF-8,其余保存为 UTF-16"
    不要这样做!

    • 不要将源文件保存在混合编码中。你会让自己发疯。
    • 我完全没有理由以 UTF-16 格式存储文件......


So, I am confused that while people say "platform dependent", is it related to the source file?

所以,我很困惑,虽然人们说“依赖平台”,但它与源文件有关吗?

Platform dependent means that it potentially depends on the operating system, the JVM vendor and version, the hardware, and so on.

平台相关意味着它可能依赖于操作系统、JVM 供应商和版本、硬件等。

It is not necessarily related to the source file. (The encoding of any given source file could be different to the default character encoding.)

它不一定与源文件相关。(任何给定源文件的编码可能与默认字符编码不同。)

If it is not, how do I explain the phenomena above? Anyway, the confusion above extend my question into "so, what happen after I compile the source file into class file, because class file might not contain the encoding information, so now the result is really dependent on 'platform' but not source file anymore?"

如果不是,我如何解释上述现象?无论如何,上面的混淆将我的问题扩展为“那么,在我将源文件编译成类文件后会发生什么,因为类文件可能不包含编码信息,所以现在结果确实依赖于'平台'而不是源文件了?”

The platform specific mechanism (e.g. the environment variables) determine what the java compiler sees as the default character set. Unless you override this (e.g. by providing options to the java compiler on the command line), that is what the Java compiler will use as the source file character set. However, this may not be the correct character encoding for the source files; e.g. if you created them on a different machine with a different default character set. And if the java compiler uses the wrong character set to decode your source files, it is liable to put incorrect character codes into the ".class" files.

特定于平台的机制(例如环境变量)决定了 java 编译器将什么视为默认字符集。除非您覆盖它(例如通过在命令行上为 java 编译器提供选项),否则 Java 编译器将使用它作为源文件字符集。但是,这可能不是源文件的正确字符编码;例如,如果您在具有不同默认字符集的不同机器上创建它们。如果 Java 编译器使用错误的字符集来解码您的源文件,则很可能将错误的字符代码放入“.class”文件中。

The ".class" files are no platform dependent. But if they are created incorrectly because you didn't tell the Java compiler the correct encoding for the source files, the ".class" files will contain the wrong characters.

“.class”文件与平台无关。但是,如果由于您没有告诉 Java 编译器源文件的正确编码而错误地创建它们,“.class”文件将包含错误的字符。



Why do you mean :" the concept of a "default character set" for an individual file is meaningless"?

为什么你的意思是:“单个文件的“默认字符集”的概念毫无意义”?

I say it because it is true!

我这么说是因为这是真的!

The default character set MEANS the character set that is used when you don't specify one.

默认字符集是指未指定时使用的字符集。

But we can control how we want text file to be stored right? Even using notepad, there is an option to choose between the encoding.

但是我们可以控制我们希望文本文件的存储方式吗?即使使用记事本,也可以选择编码。

That is correct. And that is you TELLING Notepad what character set to use for the file. If you don't TELL it, Notepad will use the default character set to write the file.

那是对的。那就是你告诉记事本该文件使用什么字符集。如果你不告诉它,记事本将使用默认字符集来写入文件。

There is a little bit of black magic in Notepad to GUESS what the character encoding is when it reads a text file. Basically, it looks at the first few of bytes of the file to see if it starts with a UTF-16 byte-order mark. If it sees one, it can heuristically distinguish between UTF-16, UTF-8 (generated by a Microscoft product), and "other". But it cannot distinguish between the different "other" character encodings, and it doesn't recognize as UTF-8 a file that doesn't start with a BOM marker. (The BOM on a UTF-8 file is a Microsoft-specific convention ... and causes problems if a Java application reads the file and doesn't know to skip the BOM character.)

记事本中有一点黑魔法来猜测读取文本文件时的字符编码是什么。基本上,它查看文件的前几个字节以查看它是否以 UTF-16 字节顺序标记开头。如果它看到一个,它可以启发式地区分 UTF-16、UTF-8(由 Microscoft 产品生成)和“其他”。但它无法区分不同的“其他”字符编码,并且无法将不以 BOM 标记开头的文件识别为 UTF-8。(UTF-8 文件上的 BOM 是 Microsoft 特定的约定......如果 Java 应用程序读取文件并且不知道跳过 BOM 字符,则会导致问题。)

Anyway, the problems are not in writing the source file. They happen when the Java compiler reads the source file with the incorrect character encoding.

无论如何,问题不在于编写源文件。当 Java 编译器读取带有不正确字符编码的源文件时,就会发生这种情况。

回答by gawi

You are working with a bad hypothesis. The getBytes()method does not use the UTF-16 encoding. It uses the platform default encoding.

您正在使用一个错误的假设。该getBytes()方法不使用 UTF-16 编码。它使用平台默认编码。

You can query it with the java.nio.charset.Charset.defaultCharset()method. In my case, it's UTF-8 and should be the same for you too.

您可以使用该java.nio.charset.Charset.defaultCharset()方法进行查询。在我的情况下,它是 UTF-8,对你来说也应该是一样的。

回答by Amit Deshpande

Default is either UTF-8orISO-8859-1if platform specific encoding is not found. Not UTF-16. So eventually you are doing bytes conversion in UTF-8only. That is why your byte[]match You can find default encoding using

如果未找到特定于平台的编码,则默认为UTF-8or ISO-8859-1。不是UTF-16。所以最终你只做字节转换UTF-8。这就是为什么你的byte[]比赛你可以找到默认编码使用

 System.out.println(Charset.defaultCharset().name());