为什么在允许某些 Unicode 字符的注释中执行 Java 代码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30727515/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 10:04:38  来源:igfitidea点击:

Why is executing Java code in comments with certain Unicode characters allowed?

javaunicodecomments

提问by Reg

The following code produces the output "Hello World!" (no really, try it).

以下代码产生输出“Hello World!” (不是真的,试试看)。

public static void main(String... args) {

   // The comment below is not a typo.
   // \u000d System.out.println("Hello World!");
}

The reason for this is that the Java compiler parses the Unicode character \u000das a new line and gets transformed into:

这样做的原因是 Java 编译器将 Unicode 字符解析\u000d为新行并转换为:

public static void main(String... args) {

   // The comment below is not a typo.
   //
   System.out.println("Hello World!");
}

Thus resulting into a comment being "executed".

从而导致评论被“执行”。

Since this can be used to "hide" malicious code or whatever an evil programmer can conceive, why is it allowed in comments?

既然这可以用来“隐藏”恶意代码或任何邪恶的程序员可以想到的东西,为什么在评论中允许它

Why is this allowed by the Java specification?

为什么 Java 规范允许这样做?

采纳答案by aioobe

Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. You don't even need to figure out where comments begin and end!

Unicode 解码发生在任何其他词汇翻译之前。这样做的主要好处是它使得在 ASCII 和任何其他编码之间来回变得微不足道。您甚至不需要弄清楚评论的开始和结束位置!

As stated in JLS Section 3.3this allows any ASCII based tool to process the source files:

JLS 第 3.3 节所述,这允许任何基于 ASCII 的工具处理源文件:

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...]

[...] Java 编程语言指定了将用 Unicode 编写的程序转换为 ASCII 的标准方法,该方法将程序更改为可由基于 ASCII 的工具处理的形式。[...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.

这为平台独立性(支持的字符集的独立性)提供了基本保证,这一直是 Java 平台的一个关键目标。

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.

能够在文件中的任何位置写入任何 Unicode 字符是一个很好的特性,在用非拉丁语言记录代码时,在注释中尤其重要。它可以以如此微妙的方式干扰语义的事实只是一个(不幸的)副作用。

There are many gotchas on this theme and Java Puzzlersby Joshua Bloch and Neal Gafter included the following variant:

这个主题有很多问题,Joshua Bloch 和 Neal Gafter 的Java Puzzlers包括以下变体:

Is this a legal Java program? If so, what does it print?

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020
\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079
\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020
\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063
\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028
\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020
\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b
\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074
\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020
\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b
\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

这是合法的 Java 程序吗?如果是这样,它打印什么?

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020
\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079
\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020
\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063
\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028
\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020
\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b
\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074
\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020
\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b
\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

(This program turns out to be a plain "Hello World" program.)

(这个程序原来是一个普通的“Hello World”程序。)

In the solution to the puzzler, they point out the following:

在解谜题中,他们指出以下几点:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can't be represented in any other way into your program. Avoid them in all other cases.

更严重的是,这个谜题加强了前三个的教训:当您需要将无法以任何其他方式表示的字符插入到您的程序中时,Unicode 转义是必不可少的。在所有其他情况下避免使用它们。



Source: Java: Executing code in comments?!

来源:Java:在注释中执行代码?!

回答by zwol

The \u000descape terminates a comment because \uescapes are uniformly converted to the corresponding Unicode characters beforethe program is tokenized. You could equally use \u0057\u0057instead of //to begina comment.

所述\u000d逸出终止评论,因为\u逃逸均匀地转化为相应的Unicode字符之前被标记化的程序。你同样可以使用\u0057\u0057的,而不是//开始评论。

This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \u000dends the comment.

这是您的 IDE 中的一个错误,它应该在语法上突出显示该行以明确表示\u000d注释结束。

This is also a design error in the language. It can't be corrected now, because that would break programs that depend on it. \uescapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that "makes sense" (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both. Either of those semantics would have prevented the comment from being terminated by the \u000descape, without interfering with the cases where \uescapes are useful—note that that includesuse of \uescapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \uescapes are significant than the compiler does. (I am not aware of any editor or IDE that will display \uescapes as the corresponding characters in anycontext, though.)

这也是语言的设计错误。现在无法更正,因为这会破坏依赖它的程序。 \u转义符应该仅在“有意义”的上下文中(字符串文字和标识符,可能没有其他地方)由编译器转换为相应的 Unicode 字符,或者它们应该被禁止生成 U+0000-007F 范围内的字符, 或两者。这两种语义中的任何一种都会阻止注释被\u000d转义终止,而不会干扰\u转义有用的情况——请注意,这包括\u在注释中使用转义作为在非拉丁文字中编码注释的一种方式,因为文本编辑器可以更广泛地了解哪里\u转义比编译器重要。(不过,我不知道有任何编辑器或 IDE 会\u任何上下文中将转义显示为相应的字符。)

There is a similar design error in the C family,1where backslash-newline is processed before comment boundaries are determined, so e.g.

在 C 系列中有一个类似的设计错误,1在确定注释边界之前处理反斜杠换行符,因此例如

// this is a comment \
   this is still in the comment!

I bring this up to illustrate that it happens to be easy to make this particular design error, and not realize that it's an error until it is too late to correct it, if you are used to thinking about tokenization and parsing the way compiler programmers think about tokenization and parsing. Basically, if you have already defined your formal grammar and then someone comes up with a syntactic special case — trigraphs, backslash-newline, encoding arbitrary Unicode characters in source files limited to ASCII, whatever — that needs to be wedged in, it's easier to add a transformation pass beforethe tokenizer than it is to redefine the tokenizer to pay attention to where it makes sense to use that special case.

我提出这一点是为了说明,如果您习惯于考虑标记化和解析编译器程序员的思维方式,很容易犯这个特定的设计错误,并且不会意识到这是一个错误,直到纠正它为时已晚关于标记化和解析。基本上,如果你已经定义了你的正式语法,然后有人提出了一个句法特殊情况——三合字母、反斜杠换行、在源文件中对任意 Unicode 字符进行编码,仅限于 ASCII 等等——这需要被嵌入,更容易分词器之前添加一个转换过程而不是重新定义分词器以注意使用该特殊情况的地方。

1For pedants: I am aware that this aspect of C was 100% intentional, with the rationale — I am not making this up — that it would allow you to mechanically force-fit code with arbitrarily long lines onto punched cards. It was still an incorrect design decision.

1对于书呆子:我知道 C 的这一方面是 100% 有意的,其理由是——我不是编造的——它允许你将任意长行的代码机械地压入打孔卡上。这仍然是一个不正确的设计决定。

回答by ZhongYu

I agree with @zwol that this is a design mistake; but I'm even more critical of it.

我同意@zwol 的观点,这是一个设计错误;但我对此更加挑剔。

\uescape is useful in string and char literals; and that's the only place that it should exist. It should be handled the same way as other escapes like \n; and "\u000A"shouldmean exactly "\n".

\u转义在字符串和字符文字中很有用;那是它应该存在的唯一地方。它应该像其他转义一样处理\n;并且"\u000A"应该是确切的意思"\n"

There is absolutely no point of having \uxxxxin comments - nobody can read that.

\uxxxx在评论中绝对没有意义- 没有人可以阅读。

Similarly, there's no point of using \uxxxxin other part of the program. The only exception is probably in public APIs that are coerced to contain some non-ascii chars - what's the last time we've seen that?

同样,\uxxxx在程序的其他部分使用也没有意义。唯一的例外可能是在被强制包含一些非 ascii 字符的公共 API 中——我们最后一次看到这种情况是什么时候?

The designers had their reasons in 1995, but 20 years later, this appears to be a wrong choice.

设计师在1995年有他们的理由,但20年后,这似乎是一个错误的选择。

(question to readers - why does this question keep getting new votes? is this question linked from somewhere popular?)

(给读者的问题 - 为什么这个问题不断获得新的投票?这个问题是否与某个流行的地方相关联?)

回答by Holger

Since this hasn't addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing:

由于这还没有解决,这里有一个解释,为什么 Unicode 转义的翻译发生在任何其他源代码处理之前:

The idea behind it was that it allows lossless translations of Java source code between different character encodings. Today, there is widespread Unicode support, and this doesn't look like a problem, but back then it wasn't easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes (including compiling and testing it) and sending the result back, all without damaging something.

它背后的想法是它允许在不同字符编码之间无损转换 Java 源代码。今天,Unicode 支持很广泛,这看起来不是问题,但当时来自西方国家的开发人员很难从他的亚洲同事那里收到一些包含亚洲字符的源代码,并进行一些更改(包括编译和测试它)并将结果发回,所有这些都不会损坏任何东西。

So, Java source code can be written in any encoding and allows a wide range of characters within identifiers, character and Stringliterals and comments. Then, in order to transfer it losslessly, all characters not supported by the target encoding are replaced by their Unicode escapes.

因此,Java 源代码可以用任何编码编写,并允许在标识符、字符和String文字以及注释中使用范围广泛的字符。然后,为了无损传输,目标编码不支持的所有字符都将替换为它们的 Unicode 转义符。

This is a reversible process and the interesting point is that the translation can be done by a tool which doesn't need to know anything about the Java source code syntax as the translation rule is not dependent on it. This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code.

这是一个可逆过程,有趣的一点是,翻译可以由不需要了解 Java 源代码语法的任何工具来完成,因为翻译规则不依赖于它。这是因为在编译器中转换为它们的实际 Unicode 字符也独立于 Java 源代码语法。这意味着您可以在两个方向上执行任意数量的翻译步骤,而无需更改源代码的含义。

This is the reason for another weird feature which hasn't even mentioned: the \uuuuuuxxxxsyntax:

这就是另一个甚至没有提到的奇怪功能的原因:\uuuuuuxxxx语法:

When a translation tool is escaping characters and encounters a sequence that is already an escaped sequence, it should insert an additional uinto the sequence, converting \ucafeto \uucafe. The meaning doesn't change, but when converting into the other direction, the tool should just remove one uand replace only sequences containing a single uby their Unicode characters. That way, even Unicode escapes are retained in their original form when converting back and forth. I guess, no-one ever used that feature…

当翻译工具在转义字符并遇到一个已经是转义序列的序列时,它应该u在序列中插入一个额外的,转换\ucafe\uucafe. 含义没有改变,但是当转换到另一个方向时,该工具应该只删除一个,u并只u用它们的 Unicode 字符替换包含单个的序列。这样,在来回转换时,即使是 Unicode 转义符也会保留其原始形式。我想,没有人使用过这个功能......

回答by Jonathan Gibbons

This was an intentional design choice that goes all the way back to the original design of Java.

这是一个有意的设计选择,可以追溯到 Java 的原始设计。

To those folks who ask "who wants Unicode escapes in comments?", I presume they are folks whose native language uses the Latin character set. In other words, it is inherent in the original design of Java that folks could use arbitrary Unicode characters wherever legal in a Java program, most typically in comments and strings.

对于那些问“谁希望在注释中使用 Unicode 转义符?”的人,我假设他们是母语使用拉丁字符集的人。换句话说,Java 的原始设计中固有的是人们可以在 Java 程序中任何合法的地方使用任意 Unicode 字符,最常见的是在注释和字符串中。

It is arguably a shortcoming in programs (like IDEs) used to view the source text that such programs cannot interpret the Unicode escapes and display the corresponding glyph.

这可以说是用于查看源文本的程序(如 IDE)的一个缺点,即此类程序无法解释 Unicode 转义符并显示相应的字形。

回答by Pepijn Schmitz

I'm going to completely ineffectually add the point, just because I can't help myself and I haven't seen it made yet, that the question is invalid since it contains a hidden premise which is wrong, namely that the code is in a comment!

我将完全无效地添加这一点,只是因为我无法帮助自己并且我还没有看到它,这个问题是无效的,因为它包含一个错误的隐藏前提,即代码在一条评论!

In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:

在 Java 源代码中,\u000d 在各方面都等价于 ASCII CR 字符。无论出现在哪里,它都是一个行尾,简单明了。问题中的格式具有误导性,该字符序列实际上在语法上对应的​​是:

public static void main(String... args) {
   // The comment below is no typo. 
   // 
 System.out.println("Hello World!");
}

IMHO the most correct answer is therefore: the code executes because it isn't in a comment; it's on the next line. "Executing code in comments" is not allowed in Java, just like you would expect.

恕我直言,最正确的答案是:代码执行是因为它不在注释中;它在下一行。正如您所期望的那样,Java 中不允许“在注释中执行代码”。

Much of the confusion stems from the fact that syntax highlighters and IDEs aren't sophisticated enough to take this situation into account. They either don't process the unicode escapes at all, or they do it after parsing the code instead of before, like javacdoes.

大部分混淆源于这样一个事实,即语法高亮器和 IDE 不够复杂,无法将这种情况考虑在内。他们要么根本不处理 unicode 转义,要么在解析代码之后而不是像之前那样javac处理。

回答by Martijn

The only people who can answer why Unicode escapes were implemented as they were are the people who wrote the specification.

唯一可以回答为什么要实现 Unicode 转义的人是编写规范的人。

A plausible reason for this is that there was the desire to allow the entire BMP as possible characters of Java source code. This presents a problem though:

一个合理的原因是希望允许整个 BMP 作为 Java 源代码的可能字符。但这带来了一个问题:

  • You want to be able to use any BMP character.
  • You want to be able to input any BMP charater reasonably easy. A way to do this is with Unicode escapes.
  • You want to keep the lexical specification easy for humans to read and write, and reasonably easy to implement as well.
  • 您希望能够使用任何 BMP 字符。
  • 您希望能够相当容易地输入任何 BMP 字符。一种方法是使用 Unicode 转义。
  • 您希望使词汇规范易于人们阅读和编写,并且也易于实现。

This is incredibly difficult when Unicode escapes enter the fray: it creates a whole load of new lexer rules.

当 Unicode 转义进入竞争时,这是非常困难的:它创建了大量新的词法分析器规则。

The easy way out is to do lexing in two steps: first search and replace all Unicode escapes with the character it represents, and then parse the resulting document as if Unicode escapes don't exist.

简单的方法是分两步进行词法分析:首先搜索所有 Unicode 转义符并将其替换为它所代表的字符,然后解析生成的文档,就好像 Unicode 转义符不存在一样。

The upside to this is that it's easy to specify, so it makes the specification simpler, and it's easy to implement.

这样做的好处是它很容易指定,因此它使规范更简单,并且易于实现。

The downside is, well, your example.

缺点是,嗯,你的例子。