从 Windows 和 Linux 读取文件会产生不同的结果(字符编码?)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6366912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:34:01  来源:igfitidea点击:

Reading File from Windows and Linux yields different results (character encoding?)

javawindowslinuxcharacter-encodingpng

提问by Maurice

Currently I'm trying to read a file in a mime format which has some binary string data of a png.

目前我正在尝试以 mime 格式读取文件,该文件具有 png 的一些二进制字符串数据。

In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.

在 Windows 中,读取文件给了我正确的二进制字符串,这意味着我只需复制字符串并将扩展名更改为 png 即可看到图片。



An example after reading the file in Windows is below:

在 Windows 中读取文件后的示例如下:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

    ‰PNG

etc...etc...

等等……等等……

An example after reading the file in Linux is below:

在 Linux 中读取文件后的示例如下:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

     ???PNG

etc...etc...

等等……等等……



I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

我无法将 Linux 版本转换为图片,因为它都变成了一些带有很多颠倒“?”的时髦符号。和“1/2”符号。

Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

任何人都可以启发我了解正在发生的事情并提供解决方案吗?已经玩了一个星期以上的代码了。

采纳答案by Vineet Reynolds

???is a sequence of three characters - 0xEF0xBF0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement characterfor illegal UTF-8 sequences.

???是三个字符的序列 - 0xEF0xBF0xBD,并且是 Unicode 代码点的 UTF-8 表示0xFFFD。代码点本身是非法 UTF-8 序列的替换字符

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG headerstarts with the byte 0x89(and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89character is displayed as .

显然,出于某种原因,您的源代码(在 Linux 上)中涉及的一组例程不准确地处理 PNG 标头。该PNG头与所述字节开始0x89(和后跟0x500x4E0x47),这是在视窗(其可能被处理该文件作为CP1252的序列字节)正确处理。在CP1252 中0x89字符显示为

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89with the UTF-8 replacement characters 0xEF0xBF0xBD(how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1as ???.

但是,在 Linux 上,该字节由 UTF-8 例程(或认为可以将文件作为 UTF-8 序列处理的库)进行解码。由于 0x89 本身不是 ASCII-7 范围内的有效代码点(参考:UTF-8 编码方案),因此它无法映射到 0x00-0x7F 范围内的有效 UTF-8 代码点。此外,它无法映射到表示为多字节 UTF-8 序列的有效代码点,因为所有多字节序列都以至少 2 位设置为 1 ( 11....) 开始,并且因为这是文件的开头,它也不能是一个连续字节。由此产生的行为是 UTF-8 解码器,现在替换0x89为 UTF-8 替换字符0xEF0xBF0xBD(多么愚蠢,考虑到文件不是 UTF-8 开始),它将显示在ISO-8859-1作为???.

If you need to resolve this problem, you'll need to ensure the following in Linux:

如果你需要解决这个问题,你需要在 Linux 中确保以下几点:

  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFDit is actually the diamond character ?) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.
  • 使用适合文件的编码(即不是 UTF-8)读取 PNG 文件中的字节;如果您将文件作为字符序列读取*,这显然是必要的,如果您单独读取字节则不需要。您可能会正确执行此操作,因此也值得验证后续步骤。
  • 当您查看文件的内容时,请使用合适的编辑器/视图,该编辑器/视图不会将文件执行任何内部解码为 UTF-8 字节序列。使用合适的字体也会有所帮助,因为您可能希望防止出现0xFFFD无法表示字形(因为它实际上是菱形字符?)的前所未有的情况,并可能导致进一步的更改(不太可能,但您永远不知道编辑器如何/viewer 已写入)。
  • 用合适的编码写出文件(如果你这样做的话)也是一个好主意——也许是 ISO-8859-1,而不是 UTF-8。如果您将文件内容作为字节而不是字符处理和存储在内存中,那么将这些内容写入输出流(不涉及任何字符串或字符引用)就足够了。

*Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

*显然,如果将字节序列转换为字符或字符串对象,Java 运行时会将字节序列解码为 UTF-16 代码点。

回答by ninjalj

In Java, Stringbyte[].

在 Java 中,Stringbyte[]

  • byte[]represents raw binary data.
  • Stringrepresents text, which has an associated charset/encoding to be able to tell which characters it represents.
  • byte[]表示原始二进制数据。
  • String表示文本,它具有关联的字符集/编码,以便能够分辨出它代表哪些字符。

Binary Data ≠ Text.

二进制数据≠文本

Text data inside a Stringhas Unicode/UTF-16 as charset/encoding (or Unicode/mUTF-8 when serialized). Whenever you convert from something that is not a Stringto a Stringor viceversa, you need to specify a charset/encoding for the non-Stringtext data (even if you do it implicitly, using the platform's default charset).

a 中的文本数据String具有 Unicode/UTF-16 作为字符集/编码(或序列化时的 Unicode/mUTF-8)。每当您从非 a 转换String为 aString或反之亦然时,您需要为非String文本数据指定一个字符集/编码(即使您使用平台的默认字符集隐式执行此操作)。

A PNG file contains raw binary data that represents an image (and associated metadata), nottext. Therefore, you should not treat it as text.

PNG 文件包含表示图像(和相关元数据)的原始二进制数据,而不是文本。因此,您不应将其视为文本。

\x89PNGis not text, it's just a "magic" header for identifying PNG files. 0x89isn't even a character, it's just an arbitrary byte value, and its only sane representations for displayare things like \x89, 0x89, ... Likewise, PNGthere is in reality binary data, it could as well have been 0xdeadbeefand it would have changed nothing. The fact that PNGhappens to be human-readable is just a convenience.

\x89PNG不是文本,它只是用于识别 PNG 文件的“神奇”标题。0x89甚至不是一个字符,它只是一个任意的字节值,它唯一合理的显示表示是\x89, 0x89, ... 同样,PNG实际上有二进制数据,它也可以是0xdeadbeef,它不会改变任何东西. PNG恰好是人类可读的事实只是一种方便。

Your problem comes from the fact that your protocol mixes text and binary data, while Java (unlike some other languages, like C) treats binary data differently than text.

您的问题来自这样一个事实,即您的协议混合了文本和二进制数据,而 Java(与其他一些语言,如 C)对二进制数据的处理方式与文本不同。

Java provides *InputStreamfor reading binary data, and *Readerfor reading text. I see two ways to deal with input:

Java 提供*InputStream读取二进制数据和*Reader读取文本的功能。我看到两种处理输入的方法:

  • Treat everything as binary data. When you read a whole text line, convert it into a String, using the appropriate charset/encoding.
  • Layer a InputStreamReaderon top of a InputStream, access the InputStreamdirectly when you want binary data, access the InputStreamReaderwhen you want text.
  • 将一切视为二进制数据。当您阅读整个文本行时String,使用适当的字符集/编码将其转换为。
  • 在 aInputStreamReader之上的层 a InputStreamInputStream需要二进制数据时直接访问,InputStreamReader需要文本时访问。

You may want buffering, the correct place to put it in the second case is below the *Reader. If you used a BufferedReader, the BufferedReaderwould probably consume more input from the InputStreamthan it should. So, you would have something like:

您可能需要缓冲,将它放在第二种情况下的正确位置在*Reader. 如果您使用 a BufferedReader,则BufferedReader可能会InputStream比它应该消耗更多的输入。所以,你会有类似的东西:

 ┌───────────────────┐
 │ InputStreamReader │
 └───────────────────┘
          ↓
┌─────────────────────┐
│ BufferedInputStream │
└─────────────────────┘
          ↓
   ┌─────────────┐
   │ InputStream │
   └─────────────┘

You would use the InputStreamReaderto read text, then you would use the BufferedInputStreamto read an appropriate amount of binary data from the same stream.

您将使用InputStreamReader来读取文本,然后您将使用BufferedInputStream来从同一流中读取适当数量的二进制数据。

A problematic case is recognizing both "\r"(old MacOS) and "\r\n"(DOS/Windows) as line terminators. In that case, you may end up reading one character too much. You could take the approach that the deprecated DataInputStream.readline()method took: transparently wrap the internal InputStreaminto a PushbackInputStreamand unread that character.

一个有问题的情况是将"\r"(旧 MacOS)和"\r\n"(DOS/Windows)都识别为行终止符。在这种情况下,您最终可能会过多地阅读一个字符。您可以采用已弃用的DataInputStream.readline()方法所采用的方法:将内部透明地包装InputStream为 aPushbackInputStream并未读取该字符。

However, since you don't appear to have a Content-Length, I would recommend the first way, treating everything as binary, and convert to Stringonly after reading a whole line. In this case, I would treat the MIME delimiter as binary data.

但是,由于您似乎没有Content-Length,我会推荐第一种方式,将所有内容都视为二进制,并String仅在阅读整行后转换为。在这种情况下,我会将 MIME 分隔符视为二进制数据。

Output:

输出:

Since you are dealing with binary data, you cannot just println()it. PrintStreamhas write()methods that can deal with binary data (e.g: for outputting to a binary file).

由于您正在处理二进制数据,因此您不能只处理println()它。PrintStream具有write()可以处理二进制数据的方法(例如:用于输出到二进制文件)。

Or maybe your data has to be transported on a channel that treats it as text. Base64is designed for that exact situation (transporting binary data as ASCII text). Base64 encoded form uses only US_ASCII characters, so you should be able to use it with any charset/encoding that is a superset of US_ASCII (ISO-8859-*, UTF-8, CP-1252, ...). Since you are converting binary data to/from text, the only sane API for Base64 would be something like:

或者,您的数据可能必须在将其视为文本的通道上传输。Base64专为这种情况而设计(将二进制数据作为 ASCII 文本传输)。Base64 编码形式仅使用 US_ASCII 字符,因此您应该能够将它与作为 US_ASCII 超集(ISO-8859-*、UTF-8、CP-1252...)的任何字符集/编码一起使用。由于您正在将二进制数据转换为文本/从文本转换,Base64 的唯一合理的 API 是这样的:

String Base64Encode(byte[] data);
byte[] Base64Decode(String encodedData);

which is basically what the internal java.util.prefs.Base64uses.

这基本上是内部java.util.prefs.Base64使用的。

Conclusion:

结论:

In Java, Stringbyte[].

在 Java 中,Stringbyte[]

Binary Data ≠ Text.

二进制数据≠文本