Java 字符集和 Windows

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/457655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-15 11:50:59  来源:igfitidea点击:

Java charset and Windows

javawindows

提问by Mike

I have a Java program that runs msinfo32.exe (system information)in an external process and then reads the file content produced by msinfo32.exe. When the Java program loads the file content into a String, the String characters are unreadable. For the String to be readable I have to create the String using String(byte[] bytes, String charsetName) and set charsetName to UTF-16. However when running on one instance of Windows2003, only UTF-16LE (little endian) results in a printable string.

我有一个 Java 程序,它在外部进程中运行 msinfo32.exe(系统信息),然后读取 msinfo32.exe 生成的文件内容。当 Java 程序将文件内容加载到 String 中时,String 字符是不可读的。为了使字符串可读,我必须使用 String(byte[] bytes, String charsetName) 创建字符串并将 charsetName 设置为 UTF-16。但是,当在 Windows2003 的一个实例上运行时,只有 UTF-16LE(小端)会产生可打印的字符串。

How can I know ahead of time which character encoding to use?

我怎样才能提前知道要使用哪种字符编码?

Also, any background information on this topic would be appreciated.

此外,将不胜感激有关此主题的任何背景信息。

回答by McDowell

Some Microsoft applications use a byte-order markto indicate Unicode files and their endianness. I can see on my Windows XP machine that the exported .NFO file starts with 0xFFFE, so it is little-endian.

某些 Microsoft 应用程序使用字节顺序标记来指示 Unicode 文件及其字节序。我可以在我的 Windows XP 机器上看到导出的 .NFO 文件以 0xFFFE 开头,所以它是小端的。

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00         __<_?_x_m_l_ _v_
65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00         e_r_s_i_o_n_=_"_
31 00 2E 00 30 00 22 00 3F 00 3E 00 0D 00 0A 00         1_._0_"_?_>_____
3C 00 4D 00 73 00 49 00 6E 00 66 00 6F 00 3E 00         <_M_s_I_n_f_o_>_
0D 00 0A 00 3C 00 4D 00 65 00 74 00 61 00 64 00         ____<_M_e_t_a_d_

Also, I recommend you switch to using Readerimplementations rather than the String constructor for decoding files; this helps avoid problems where you read half a character because it is truncated because it is sitting at the end of a byte array.

另外,我建议您改用Reader实现而不是 String 构造函数来解码文件;这有助于避免读取半个字符的问题,因为它位于字节数组的末尾而被截断。

回答by Fabian Steeg

You could try to use a library to guess the encoding, for instance I have once used this solution.

您可以尝试使用库来猜测编码,例如我曾经使用过这个解决方案

回答by Bombe

You can't really know what character encoding has been used (unless you created the tool that created the output you're processing). You can try to detect a list of pre-defined encodings and choose the one that does not result in any decoding errors but depending on the input that might match a lot of different encodings.

您无法真正知道使用了什么字符编码(除非您创建了用于创建正在处理的输出的工具)。您可以尝试检测一系列预定义的编码,然后根据可能匹配许多不同编码的输入选择不会导致任何解码错误的编码。

回答by kgiannakakis

If you don't know beforehand the character encoding and this is different among various platforms, then you need to somehow analyze the byte array to try to guess it. There are some detecting algorithms available, but it may be an overkill for your application.

如果您事先不知道字符编码并且这在各个平台之间有所不同,那么您需要以某种方式分析字节数组以尝试猜测它。有一些检测算法可用,但这对您的应用程序来说可能是一种矫枉过正。

Can you tweak your application to produce a known output? No need to be a full line, only the first characters will do. If yes, then you could compare the produced byte array with the expected in various encodings and do the detecting. The byte arrays of UTF8, UTF-16 big and little endian will be different event for simple strings.

你能调整你的应用程序来产生一个已知的输出吗?不需要是整行,只有第一个字符就可以了。如果是,那么您可以将生成的字节数组与各种编码中的预期进行比较并进行检测。对于简单字符串,UTF8、UTF-16 big endian 和 little endian 的字节数组将是不同的事件。

回答by Alan Moore

The way it's supposed to work is, if someone gives you a file and says it's UTF-16, they expect you to examine the first two bytes (the BOM) to find out whether it's big-endian or little-endian. But if they tell you the encoding is UTF-16LE, it means there's no BOM; you don't need it because they've already told you the byte order is little-endian. Java follows these rules precisely, which is a real pisser because nobody else does.

它应该工作的方式是,如果有人给你一个文件并说它是 UTF-16,他们希望你检查前两个字节(BOM)以确定它是大端还是小端。但是如果他们告诉你编码是UTF-16LE,那就意味着没有 BOM;你不需要它,因为他们已经告诉你字节顺序是小端。Java 精确地遵循这些规则,这是一个真正的麻烦,因为没有其他人这样做。

The native character encoding of modern Windows operating systems is UTF-16, little-endian. Unfortunately, individual programs don't seem to be consistent when it comes to byte-order marks. And you can't just use UTF-16LE all the time because, if the BOM isthere, it will be passed through as a junk character. The only way to know ahead of time whether to use UTF-16 or UTF-16LE is to examine the first two bytes, as McDowell described.

现代 Windows 操作系统的本机字符编码是 UTF-16,little-endian。不幸的是,当涉及到字节顺序标记时,单个程序似乎并不一致。你不能只使用UTF-16LE因为,如果BOM所有的时间在那里,它会通过为垃圾字符传递。提前知道是使用 UTF-16 还是 UTF-16LE 的唯一方法是检查前两个字节,如 McDowell 所述。