Java 字符串字符编码 - 法语 - 荷兰语语言环境

Question

提问by Anand Sunderraman

I have the following piece of code

我有以下代码

public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(Charset.defaultCharset().toString());

        String accentedE = "é";

        String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes("utf-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes());
        System.out.println(utf8);
}

The output of the above is as follows

上面的输出如下

windows-1252
é
?
??
é

Can someone help me understand what does this do ? Why this output ?

有人可以帮我理解这是做什么的吗？为什么是这个输出？

Answer 1

回答by Esailija

If you already have a String, there is no need to encode and decode it right back, the string is already a result from someone having decoded raw bytes.

如果您已经有了String，则无需立即对其进行编码和解码，该字符串已经是某人解码原始字节的结果。

In the case of a string literal, the someone is the compiler reading your source as raw bytes and decoding it in the encoding you have specified to it. If you have physically saved your source file in Windows-1252 encoding, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler to use when compiling your source...

在字符串文字的情况下，某人是编译器将您的源作为原始字节读取并以您指定的编码对其进行解码。如果您已将源文件物理保存为 Windows-1252 编码，并且编译器将其解码为 Windows-1252，则一切正常。如果没有，您需要通过声明编译器在编译源代码时使用的正确编码来解决此问题...

The line

线

String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));

Does absolutely nothing. (Encode as UTF-8, Decode as UTF-8 == no-op)

绝对什么都不做。（编码为 UTF-8，解码为 UTF-8 == no-op）

The line

线

utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));

Encodes string as Windows-1252, and then decodes it as UTF-8. The result must only be decoded in Windows-1252 (because it isencoded in Windows-1252, duh), otherwise you will get strange results.

将字符串编码为 Windows-1252，然后将其解码为 UTF-8。结果只能在 Windows-1252 中解码（因为它是在 Windows-1252 中编码的，废话），否则你会得到奇怪的结果。

The line

线

utf8 = new String(accentedE.getBytes("utf-8"));

Encodes a string as UTF-8, and then decodes it as Windows-1252. Same principles apply as in previous case.

将字符串编码为 UTF-8，然后将其解码为 Windows-1252。与前一种情况相同的原则适用。

The line

线

utf8 = new String(accentedE.getBytes());

Does absolutely nothing. (Encode as Windows-1252, Decode as Windows-1252 == no-op)

绝对什么都不做。（编码为 Windows-1252，解码为 Windows-1252 == no-op）

Analogy with integers that might be easier to understand:

用整数类比可能更容易理解：

int a = 555;
//The case of encoding as X and decoding right back as X
a = Integer.parseInt(String.valueOf(a), 10);
//a is still 555

int b = 555;
//The case of encoding as X and decoding right back as Y
b = Integer.parseInt(String.valueOf(b), 15);
//b is now 1205 I.E. strange result

Both of these are useless because we already have what we needed before doing any of the code, the integer 555.

这两个都没用，因为在执行任何代码之前我们已经有了我们需要的东西，整数555。

There is a need for encoding your string into raw bytes when it leaves your systemand there is a need for decoding raw bytes into a string when they come into your system. There is no need to encode and decode right back within the system.

当您的字符串离开您的系统时，需要将其编码为原始字节，当它们进入您的系统时，需要将原始字节解码为字符串。无需在系统内部立即进行编码和解码。

Answer 2

回答by Stephen C

Line #1 - the default character set on your system is windows-1252.

第 1 行 - 您系统上的默认字符集是 windows-1252。

Line #2 - you created a String by encoding a String literal to UTF-8 bytes, and then decoding it using the UTF-8 scheme. The result is correctly formed String, which can be output correctly using windows-1252 encoding.

第 2 行 - 您通过将字符串文字编码为 UTF-8 字节，然后使用 UTF-8 方案对其进行解码来创建字符串。结果是正确形成的String，可以使用windows-1252编码正确输出。

Line #3 - you created a String by encoding a string literal as windows-1252, and then decoding it using UTF-8. The UTF-8 decoder has detected a sequence that cannot possibly be UTF-8, and has replaced the offending character with a question mark"?". (The UTF-8 format says that any byte that has the top bit set to 1 is one byte of a multi-byte character. But the windows-1252 encoding is just one byte long .... ergo, this is bad UTF-8)

第 3 行 - 您通过将字符串文字编码为 windows-1252，然后使用 UTF-8 对其进行解码来创建字符串。UTF-8 解码器检测到一个不可能是 UTF-8 的序列，并用问号“？”替换了有问题的字符。（UTF-8 格式表示任何将最高位设置为 1 的字节都是多字节字符的一个字节。但 windows-1252 编码只有一个字节长......因此，这是错误的 UTF- 8)

Line #4 - you created a String by encoding in UTF-8 and then decoding in windows-1252. In this case the decoding has not "failed", but it has produced garbage (aka mojibake). The reason you got 2 characters of output is that the UTF-8 encoding of "é" is a 2 byte sequence.

第 4 行 - 您通过在 UTF-8 中编码然后在 windows-1252 中解码来创建一个字符串。在这种情况下，解码并没有“失败”，而是产生了垃圾（又名 mojibake）。您得到 2 个字符的输出的原因是“é”的 UTF-8 编码是一个 2 字节序列。

Line #5 - you created a String by encoding as windows-1252 and decoding as windows-1252. This produce the correct output.

第 5 行 - 您通过编码为 windows-1252 并解码为 windows-1252 创建了一个字符串。这会产生正确的输出。

And the overall lesson is that if you encode characters to bytes with one character encoding, and then decode with a different character encoding you are liable to get mangling of one form or another.

总的教训是，如果您使用一种字符编码将字符编码为字节，然后使用不同的字符编码进行解码，您很可能会损坏一种或另一种形式。

Answer 3

回答by linski

When you call upon String getBytesmethod it:

当您调用 String getBytes方法时：

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

使用平台的默认字符集将此 String 编码为字节序列，并将结果存储到新的字节数组中。

So whenever you do:

所以每当你这样做时：

accentedE.getBytes()

it takes the contents of accentedE String as bytes encodedin the default OS code page, in your case cp-1252.

它将重音字符串的内容作为默认操作系统代码页中编码的字节，在您的情况下为cp-1252。

This line:

这一行：

new String(accentedE.getBytes(), Charset.forName("UTF-8"))

takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:

获取重音字节（以 cp1252 编码）并尝试以 UTF-8 对它们进行解码，因此出现错误。同样的情况从另一边来说：

new String(accentedE.getBytes("utf-8"))

The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructorencodes them with the default OS codepage which is cp-1252.

getBytes 方法采用 cp-1252 编码的重音 E 字节，将它们重新编码为 UTF-8，然后 String构造函数使用默认的操作系统代码页 cp-1252 对它们进行编码。

Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.

通过使用平台的默认字符集解码指定的字节数组来构造一个新的 String。新字符串的长度是字符集的函数，因此可能不等于字节数组的长度。

I strongly recommend reading this excellent article:

我强烈推荐阅读这篇优秀的文章：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低要求（没有任何借口！）

UPDATE:

更新：

In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:

简而言之，每个字符都存储为一个数字。为了知道哪个字符是哪个数字，操作系统使用代码页。考虑以下片段：

String accentedE = "é";

System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));

which outputs:

输出：

C3 
A9 
E9

That is because small accented e in UTF-8is stored as two bytes of value C3A9, while in cp-1252is stored as a single byte of value E9. For detailed explanation read the linked article.

这是因为UTF-8 中的小重音 e存储为两个字节的 value C3A9，而在cp-1252中存储为单个字节的 value E9。有关详细说明，请阅读链接文章。

Java 字符串字符编码 - 法语 - 荷兰语语言环境

提问by Anand Sunderraman

回答by Esailija

回答by Stephen C

回答by linski

相关推荐

最近更新

标签

Java 字符串字符编码 - 法语 - 荷兰语语言环境

提问by Anand Sunderraman

回答by Esailija

回答by Stephen C

回答by linski

相关推荐

java SOAP 1.2 消息在发送到仅限 SOAP 1.1 的端点时无效

java org.apache.lucene.store.LockObtainFailedException：锁获取超时：

java 如何在 Spring 中池对象？

java 附加有 __pm 的 JNDI 资源名称。部署失败

相关推荐

最近更新

标签