java ISO-8859-1 编码和二进制数据保存

Question

提问by Mr_and_Mrs_D

I read in a commentto an answer by @Esailija to a question of mine that

我在评论中阅读了@Esailija 对我的一个问题的回答

ISO-8859-1 is the only encoding to fully retain the original binary data, with exact byte<->codepoint matches

ISO-8859-1 是唯一完全保留原始二进制数据的编码，具有精确的字节<->代码点匹配

I also read in this answerby @AaronDigulla that :

我还在@AaronDigulla 的这个回答中读到：

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping

在 Java 中，ISO-8859-1（又名 ISO-Latin1）是一个 1:1 映射

I need some insight on this. This will fail (as illustrated here) :

我需要对此有所了解。这将失败（如图所示这里）：

// \u00F6 is ?
System.out.println(Arrays.toString("\u00F6".getBytes("utf-8")));
// prints [-61, -74]
System.out.println(Arrays.toString("\u00F6".getBytes("ISO-8859-1")));
// prints [-10]

Questions

问题

I admit I do not quite get it - why does it not get the bytes in the code above?
Most importantly, where is this(byte preserving behavior ofISO-8859-1) specified- links to source, or JSL would be nice. Is it the only encoding with this property ?
Is it related to ISO-8859-1being the default default?

我承认我不太明白 -为什么它没有得到上面代码中的字节？
最重要的是，这（的字节保留行为ISO-8859-1）在哪里指定- 链接到源或 JSL 会很好。它是具有此属性的唯一编码吗？
它与ISO-8859-1成为默认默认值有关吗？

See also this questionfor nice counter examples from other charsets.

另请参阅此问题以获取来自其他字符集的不错的反例。

Answer 1

回答by JB Nizet

"\u00F6"is not a byte array. It's a string containing a single char. Execute the following test instead:

"\u00F6"不是字节数组。它是一个包含单个字符的字符串。改为执行以下测试：

public static void main(String[] args) throws Exception {
    byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};
    String s = new String(b, "ISO-8859-1"); // decoding
    byte[] b2 = s.getBytes("ISO-8859-1"); // encoding
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2)); // true
}

To check that this is true for any byte, just improve the code an loop through all the bytes:

要检查这是否适用于任何字节，只需改进代码循环遍历所有字节：

public static void main(String[] args) throws Exception {
    byte[] b = new byte[256];
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte) i;
    }
    String s = new String(b, "ISO-8859-1");
    byte[] b2 = s.getBytes("ISO-8859-1");
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2));
}

ISO-8859-1 is a standard encoding. So the language used (Java, C# or whatever) doesn't matter.

ISO-8859-1 是一种标准编码。所以使用的语言（Java、C# 或其他）并不重要。

Here's a Wikipedia referencethat claims that every byte is covered:

这是一个维基百科参考，声称每个字节都被覆盖：

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

1992 年，IANA 注册了字符映射 ISO_8859-1:1987，更广为人知的是其首选的 MIME 名称 ISO-8859-1（注意 ISO 8859-1 上的额外连字符），它是 ISO 8859-1 的超集，用于在互联网上使用。该映射将 C0 和 C1 控制字符分配给未分配的代码值，从而通过每个可能的 8 位值提供 256 个字符。

(emphasis mine)

（强调我的）

Answer 2

回答by Esailija

For an encoding to retain original binary data, it needs to map every unique byte sequence to an unique character sequence.

对于保留原始二进制数据的编码，需要将每个唯一的字节序列映射到唯一的字符序列。

This rules out all multi-byte encodings (UTF-8/16/32, Shift-Jis, Big5 etc) because not every byte sequence is valid in them and thus would decode to some replacement character (usually ? or ?). There is no way to tell from the string what caused the replacement character after it has been decoded.

这排除了所有多字节编码（UTF-8/16/32、Shift-Jis、Big5 等），因为并非每个字节序列在其中都有效，因此会解码为某些替换字符（通常是 ? 或 ?）。无法从字符串中得知是什么导致了替换字符被解码后的替换字符。

Another option is to ignore the invalid bytes but this also means that infinite different byte sequences decode to the same string. You could replace invalid bytes with their hex encoding in the string like "0xFF". There is no way to tell if the original bytes legitimately decoded to "0xFF"so that doesn't work either.

另一种选择是忽略无效字节，但这也意味着无限不同的字节序列解码为相同的字符串。您可以将无效字节替换为字符串中的十六进制编码，例如"0xFF". 没有办法判断原始字节是否合法解码，"0xFF"因此这也不起作用。

This leaves 8-bit encodings, where every sequence is just a single byte. The single byte is valid if there is a mapping for it. But many 8-bit encodings have holes and don't encode 256 different characters.

这留下了 8 位编码，其中每个序列只是一个字节。如果存在映射，则单个字节有效。但是许多 8 位编码都有漏洞，并且不能编码 256 个不同的字符。

To retain original binary data, you need 8-bit encoding that encodes 256 different characters. ISO-8859-1 is not unique in this. But what it is unique in, is that the decoded code point's value is also the byte's value it was decoded from.

要保留原始二进制数据，您需要对 256 个不同字符进行编码的 8 位编码。ISO-8859-1 在这方面并不是唯一的。但它的独特之处在于，解码后的代码点的值也是解码的字节值。

So you have the decoded string, and encoded bytes, then it is always

所以你有解码的字符串和编码的字节，那么它总是

(byte)str.charAt(i) == bytes[i]

for arbitrary binary data where stris new String(bytes, "ISO-8859-1")and bytesis a byte[].

对于任意二进制数据，其中strisnew String(bytes, "ISO-8859-1")和bytesis a byte[]。

It also has nothing to do with Java. I have no idea what his comment means, these are properties of character encodings not programming languages.

它也与Java无关。我不知道他的评论是什么意思，这些是字符编码的属性，而不是编程语言。

java ISO-8859-1 编码和二进制数据保存

提问by Mr_and_Mrs_D

Questions

问题

回答by JB Nizet

回答by Esailija

相关推荐

最近更新

标签

java ISO-8859-1 编码和二进制数据保存

提问by Mr_and_Mrs_D

Questions

问题

回答by JB Nizet

回答by Esailija

相关推荐

java 在java中使用正则表达式从字符串中提取数字

java 什么是解析的“推方法”和“拉方法”？

java 路径测试和分支测试

java 如何检查 PDF 是否受密码保护

相关推荐

最近更新

标签