java Java编译平台文件编码问题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4927575/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java compiler platform file encoding problem
提问by Richard Brewster
Recently I encountered a file character encoding issue that I cannot remember ever having faced. It's quite common to have to be aware of character encoding of text files and write code that handles encoding correctly when run on different platforms. But the problem I found was caused by compilationon a different platform from the execution platform. That was entirely unexpected, because in my experience when javac creates a class file, the important parameters are the java source and target params, and the version of the JDK doing the compile. I my case, classes compiled with JDK 1.6.0_22 on Mac OS X behaved differently than classes compiled with 1.6.0_23-b05 on Linux, when run on Mac OS X. The specified source and target were 1.4.
最近我遇到了一个我不记得曾经遇到过的文件字符编码问题。必须了解文本文件的字符编码并编写在不同平台上运行时正确处理编码的代码是很常见的。但是我发现的问题是在与执行平台不同的平台上编译引起的。这是完全出乎意料的,因为根据我在 javac 创建类文件时的经验,重要的参数是 java 源和目标参数,以及进行编译的 JDK 版本。我的情况是,在 Mac OS X 上使用 JDK 1.6.0_22 编译的类在 Mac OS X 上运行时与在 Linux 上使用 1.6.0_23-b05 编译的类的行为不同。指定的源和目标是 1.4。
A String that was encoded as ISO-8859_1 in memory was written to disk using a PrintStream println method. Depending on which platform the Java code was COMPILED on, the string was written differently. This lead to a bug. The fix for the bug was to specify the file encoding explicitly when writing and reading the file.
使用 PrintStream println 方法将内存中编码为 ISO-8859_1 的字符串写入磁盘。根据编译 Java 代码的平台,字符串的编写方式不同。这会导致一个错误。该错误的修复是在写入和读取文件时明确指定文件编码。
What surprised me was that the behavior differed depending on where the classes were compiled, not on which platform the class was run. I'm quite familiar with Java code behaving differently when run on different platforms. But it is a bit scary when the same code, compiled on different platforms, runs differently on the same platform.
令我惊讶的是,行为的不同取决于类的编译位置,而不是类在哪个平台上运行。我非常熟悉 Java 代码在不同平台上运行时的不同行为。但是当相同的代码,在不同的平台上编译,在同一个平台上的运行方式不同时,就有点吓人了。
Has anyone encountered this specific problem? It would seem to bode ill for any Java code that reads and writes strings to file without explicitly specifying the character encoding. And how often is that done?
有没有人遇到过这个特定问题?对于在没有明确指定字符编码的情况下读取和写入字符串到文件的任何 Java 代码来说,这似乎是一个不祥之兆。多久做一次?
回答by Pa?lo Ebermann
There are no such things like a a String that was encoded as ISO-8859-1 in memory. Java Strings in memory are always Unicode strings. (Encoded in UTF-16 (as of 2011 – I think it changed with later Java versions), but you don't really need to now this).
没有像String 在 memory 中编码为 ISO-8859-1这样的东西。内存中的 Java 字符串始终是 Unicode 字符串。(以 UTF-16 编码(截至 2011 年 - 我认为它随着更高的 Java 版本而改变),但您现在真的不需要这样做)。
The encoding comes only in play when you input or output the string - then, given no explicit encoding, it uses the system default (which on some systems depends on user settings).
编码仅在您输入或输出字符串时起作用 - 然后,在没有明确编码的情况下,它使用系统默认值(在某些系统上取决于用户设置)。
As said by McDowell, the actual encoding of your source file should be matched by the encoding which your compiler assumes about your source file, otherwise you get problems as you observed. You can achieve this by several means:
正如 McDowell 所说,源文件的实际编码应该与编译器对源文件假设的编码相匹配,否则您会遇到问题。您可以通过多种方式实现这一目标:
- Use the
-encoding
option of the compiler, giving the encoding of your source file. (With ant, you set theencoding=
parameter.) - Use your editor or any other tool (like
recode
) to change the encoding of your file to the compiler default. - use
native2ascii
(with the right-encoding
option) to translate your source file to ASCII with\uXXXX
-escapes.
- 使用
-encoding
编译器的选项,给出源文件的编码。(使用 ant,您可以设置encoding=
参数。) - 使用您的编辑器或任何其他工具(如
recode
)将文件的编码更改为编译器默认值。 - 使用
native2ascii
(使用正确的-encoding
选项)将源文件转换为带有\uXXXX
-escapes 的ASCII 。
In the last case, you later can compile this file everywhere with every default encoding, so this may be the way to go if you give the sourcecode to encoding-unaware persons to compile somewhere.
在最后一种情况下,您以后可以使用每种默认编码在任何地方编译此文件,因此如果您将源代码提供给不知道编码的人在某处编译,这可能是要走的路。
If you have a bigger project consisting of more than one file, they should all have the same encoding, since the compiler has only one such switch, not several.
如果你有一个由多个文件组成的更大的项目,它们都应该有相同的编码,因为编译器只有一个这样的开关,而不是几个。
In all projects I had in the last years, I always encode all my files in UTF-8, and in my ant buildfile set the encoding="utf-8"
parameter to the javac task. (My editor is smart enough to automatically recognize the encoding, but I set the default to UTF-8.)
在过去几年的所有项目中,我总是用 UTF-8 编码我的所有文件,并在我的 ant 构建文件中将encoding="utf-8"
参数设置为 javac 任务。(我的编辑器足够聪明,可以自动识别编码,但我将默认设置为 UTF-8。)
The encoding matters to other source-code handling tools to, like javadoc. (There you should additionally the -charset
and -docencoding
options for the output - they should match, but can be different to the source--encoding
.)
编码对于其他源代码处理工具很重要,例如 javadoc。(在那里,您还应该为输出添加-charset
和-docencoding
选项 - 它们应该匹配,但可以与源不同 - -encoding
。)
回答by McDowell
I'd hazard a guess that there is a transcoding issue during the compilation stage and the compiler lacks direction as to the encoding of a source file (e.g. see the javac -encoding
switch).
我猜测在编译阶段存在转码问题并且编译器缺乏关于源文件编码的方向(例如,参见 javac-encoding
开关)。
Compilers generally use the system default encoding if you aren't specific which can lead to string and char literals being corrupted (internally, Java bytecode uses a modified UTF-8 form, so binaries are portable). This is the only way I can imagine that problems are being introduced at compile time.
如果您不是特定的,编译器通常使用系统默认编码,这会导致字符串和字符文字被损坏(在内部,Java 字节码使用修改后的 UTF-8 格式,因此二进制文件是可移植的)。这是我能想象在编译时引入问题的唯一方法。
I've written a bit about this here.
我在这里写了一些关于这个的内容。
回答by KitsuneYMG
I've had similar issues when using variable names that aren't ascii (Σ, σ, Δ, etc) when doing math formula. On linux, it used UTF-8 encoding while interpreting. On windows it complained about invalid names because windows uses ISO-LATIN-1. The solution was to specify the encoding in the ant script I used to compile these files.
在做数学公式时,我在使用不是 ascii(Σ、σ、Δ 等)的变量名时遇到了类似的问题。在 linux 上,它在解释时使用 UTF-8 编码。在 Windows 上,它抱怨名称无效,因为 Windows 使用 ISO-LATIN-1。解决方案是在我用来编译这些文件的 ant 脚本中指定编码。
回答by jtahlborn
Always use escape codes (e.g \uxxxx
) in your source files and this will not be a problem. @Paulo mentioned this, but i wanted to call it out explicitly.
始终\uxxxx
在源文件中使用转义码(例如),这不会有问题。@Paulo 提到了这一点,但我想明确指出这一点。