为什么 Java 允许在源代码中使用转义的 unicode 字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4448180/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 06:25:37  来源:igfitidea点击:

Why does Java permit escaped unicode characters in the source code?

javaunicodelanguage-features

提问by Zaven Nahapetyan

I recently learnedthat Unicode is permitted within Java source code not only as Unicode characters (eg. double π = Math.PI;) but also as escaped sequences (eg. double \u03C0 = Math.PI;).

最近了解到,在 Java 源代码中,Unicode 不仅可以作为 Unicode 字符(例如double π = Math.PI;),还可以作为转义序列(例如double \u03C0 = Math.PI;)。

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.

第一个变体对我来说很有意义——它允许程序员用他们选择的国际语言命名变量和方法。但是,我没有看到第二种方法的任何实际应用。

Here are a few pieces of code to illustrate usage, tested with Java SE 6 and NetBeans 6.9.1:

下面是几段代码来说明用法,在 Java SE 6 和 NetBeans 6.9.1 上测试过:

This code will print out 3.141592653589793

此代码将打印出 3.141592653589793

public static void main(String[] args) {
    double π = Math.PI;
    System.out.println(\u03C0);
}

Explanation: π and \u03C0 are the same Unicode character

说明:π 和 \u03C0 是同一个 Unicode 字符

This code will not print out anything

此代码不会打印出任何内容

public static void main(String[] args) {
    double π = Math.PI; /\u002A
    System.out.println(π);

    /* a comment */
}

Explanation: The code above actually encodes:

说明:上面的代码实际上编码:

public static void main(String[] args) {
    double π = Math.PI; /*
    System.out.println(π);

    /* a comment */
}

Which comments out the print satement.

其中注释掉了打印语句。

Just from my examples, I notice a number of potential problems with this language feature.

仅从我的示例中,我注意到此语言功能存在许多潜在问题。

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. Perhaps there are other horrible things that can be done that I haven't thought of.

首先,一个糟糕的程序员可能会用它来秘密地注释掉一些代码,或者创建多种方法来识别相同的变量。也许还有其他可怕的事情可以做,但我没有想到。

Second, there seems to be a lack of support among IDEs. Neither NetBeans nor Eclipse provided the correct code highlighting for the examples. In fact, NetBeans even marked a syntax error (though compilation was not a problem).

其次,IDE 之间似乎缺乏支持。NetBeans 和 Eclipse 都没有为示例提供正确的代码突出显示。事实上,NetBeans 甚至标记了一个语法错误(尽管编译不是问题)。

Finally, this feature is poorly documented and not commonly accepted. Why would a programmer use something in his code that other programmers will not be able to recognize and understand? In fact, I couldn't even find something about this on the Hidden Java Features question.

最后,这个特性的文档很差,不被普遍接受。为什么程序员会在他的代码中使用其他程序员无法识别和理解的东西?事实上,我什至无法在Hidden Java Features question上找到有关此内容的信息。

My question is this:

我的问题是这样的:

Why does Java allow escaped Unicode sequences to be used within syntax? What are some "pros" of this feature that have allowed it to stay a part Java, despite its many "cons"?

为什么 Java 允许在语法中使用转义的 Unicode 序列?尽管有许多“缺点”,但此功能的哪些“优点”使其成为 Java 的一部分?

采纳答案by Michael Borgwardt

Unicode escape sequences allow you to store and transmit your source code in pure ASCII and still use the entire range of Unicode characters. This has two advantages:

Unicode 转义序列允许您以纯 ASCII 存储和传输源代码,并且仍然使用整个 Unicode 字符范围。这有两个优点:

  • No risk of non-ASCII characters getting broken by tools that can't handle them. This was a real concern back in the early 1990s when Java was designed. Sending an email containing non-ASCII characters and having it arrive unmangled was the exception rather than the norm.

  • No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code. This is still a very valid concern. Of course, a much better solution would have been to have the encoding as metadata in a file header (as in XML), but this hadn't yet emerged as a best practice back then.

  • 没有非 ASCII 字符被无法处理的工具破坏的风险。在 1990 年代初期设计 Java 时,这确实是一个令人担忧的问题。发送包含非 ASCII 字符的电子邮件并使其到达时未损坏是例外而不是常态。

  • 无需告诉编译器和编辑器/IDE 使用哪种编码来解释源代码。这仍然是一个非常有效的担忧。当然,更好的解决方案是将编码作为文件头中的元数据(如在 XML 中),但这在当时还没有成为最佳实践。

The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.

第一个变体对我来说很有意义——它允许程序员用他们选择的国际语言命名变量和方法。但是,我没有看到第二种方法的任何实际应用。

Both will result in exactly the same byte code and have the same power as a language feature. The only difference is in the source code.

两者都将产生完全相同的字节码,并具有与语言功能相同的功能。唯一的区别在于源代码。

First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable.

首先,一个糟糕的程序员可能会用它来秘密地注释掉一些代码,或者创建多种方法来识别相同的变量。

If you're concerned about a programmer deliberatelysabotaging your code's readability, this language feature is the least of your problems.

如果您担心程序员故意破坏代码的可读性,那么此语言功能是您遇到的最少问题。

Second, there seems to be a lack of support among IDEs.

其次,IDE 之间似乎缺乏支持。

That's hardly the fault of the feature or its designers. But then, I don't think it was ever intended to be used "manually". Ideally, the IDE would have an option to have you enter the characters normally and have them displayed normally, but automatically save them as Unicode escape sequences. There may even already be plugins or configuration options that makes the IDEs behave that way.

这几乎不是该功能或其设计者的错。但是,我认为它从未打算“手动”使用。理想情况下,IDE 可以选择让您正常输入字符并正常显示它们,但会自动将它们保存为 Unicode 转义序列。甚至可能已经有插件或配置选项使 IDE 以这种方式运行。

But in general, this feature seems to be very rarely used and probably therefore badly supported. But how could the people who designed Java around 1993 have known that?

但总的来说,此功能似乎很少使用,因此可能得不到很好的支持。但是在 1993 年左右设计 Java 的人怎么会知道呢?

回答by Steven Schlansker

The nice thing about the \u03C0encoding is that it is much less likely to be munged by a text editor with the wrong encoding settings. For example a bug in my software was caused by the accidental transformation from UTF-8 éinto a MacRoman éby a misconfigured text editor. By specifying the Unicode codepoint, it's completely unambiguous what you mean.

有关的好处\u03C0编码是,它是不太可能用了错误的编码设置的文本编辑器来改写的。例如,我的软件中的一个错误éé由错误配置的文本编辑器从 UTF-8 意外转换为 MacRoman引起的。通过指定 Unicode 代码点,您的意思完全明确。

回答by Thorbj?rn Ravn Andersen

The \uXXXX syntax allows Unicode characters to be represented unambiguously in a file with an encoding not capable of expressing them directly, or if you want a representation guaranteed to be usable even in the lowest common denominator, namely an 7-bit ASCII encoding.

\uXXXX 语法允许 Unicode 字符在文件中明确表示,其编码无法直接表达,或者如果您希望表示保证即使在最低公分母中也可用,即 7 位 ASCII 编码。

You couldrepresent all your characters with \uXXXX, even spaces and letters, but there is rarely a need to.

可以用 \uXXXX 表示所有字符,甚至是空格和字母,但很少需要这样做。

回答by AlexR

First, thank you for the question. I think it is very interesting. Second, the reason is that the java source file is a text that can use itself various charsets. For example the default charset in Eclipse is Cp1255. This endoding does not support characters like π. I think that they thought about programmers that have to work on systems that do not support unicode and wanted to allow these programmers to create unicode enabled software. This was the reason to support \u notation.

首先谢谢你的提问。我觉得这很有趣。其次,原因是java源文件是一个可以使用自身各种字符集的文本。例如,Eclipse 中的默认字符集是 Cp1255。此结尾不支持 π 之类的字符。我认为他们考虑到程序员必须在不支持 unicode 的系统上工作,并希望允许这些程序员创建支持 unicode 的软件。这就是支持 \u 符号的原因。