为什么不间断空格不是java中的空白字符?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1060570/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 22:57:42  来源:igfitidea点击:

Why is non-breaking space not a whitespace character in java?

javaunicode

提问by Palimondo

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on java's spartan definition of String.trim()which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on Character class would do the job for me.

在寻找一种从解析的 HTML 中修剪不间断空间的正确方法时,我首先偶然发现了 Java 的 spartan 定义,String.trim()它至少有正确的文档记录。我想避免明确列出符合修剪条件的字符,因此我认为在 Character 类上使用支持 Unicode 的方法可以为我完成这项工作。

That's when I discovered that Character.isWhitespace(char)explicitly excludes non-breaking spaces:

那是我发现Character.isWhitespace(char)明确排除不间断空格的时候:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space('\u00A0', '\u2007', '\u202F').

它是一个 Unicode 空格字符(SPACE_SEPARATORLINE_SEPARATOR、 或PARAGRAPH_SEPARATOR),但也不是一个不间断空格'\u00A0''\u2007''\u202F')。

Why is that?

这是为什么?

The implementation of corresponding .NET equivalentis less discriminating.

相应的 .NET 等价物的实现较少区分。

采纳答案by Steve McLeod

Character.isWhitespace(char)is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Character.isWhitespace(char)老了。真的老了。Java 早期所做的许多事情都遵循 C 的约定和实现。

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

现在,十多年后,这些事情似乎是错误的。认为它证明了事情已经走了多远,即使在 Java 的最初几天和 .NET 的最初几天之间也是如此。

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

Java 力求 100% 向后兼容。因此,即使 Java 团队认为修复他们最初的错误并向从 Character.isWhitespace(char) 返回 true 的字符集中添加不间断空格会很好,但他们也不能,因为几乎可以肯定存在这样的软件依赖于当前的实现完全按照它的方式工作。

回答by Jason S

It looks like the method name (isWhitespace) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

看起来方法名称 ( isWhitespace) 与其功能(检测分隔符)不一致。如果您查看引用的 Javadoc 页面中的完整字符列表,则“分隔符”功能非常清楚:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR. 

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

不间断空格的功能应该是单词之间的视觉空间,不被连字算法分隔。

回答by Matt Poush

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

我认为 Java 的实现比 .NET 的更正确。不间断空格本质上是一个看起来像一个的非空白字符。也就是说,如果您有字符串“foo”和“bar”,并在它们之间放置任何传统的空白字符,则会出现断字。然而,一个不间断的空间不会将两者分开。

回答by richardtallent

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

唯一应该对不间断空格进行特殊处理的情况是使用旨在执行文本自动换行的代码。

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace.

对于所有其他目的,包括字数、修剪和沿字边界的通用拆分,不间断空格仍然是 whitespace

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

任何关于不间断空格只是“看起来像”空格但不是空格的论点都与 Unicode 的全部意义相冲突,Unicode 是根据字符的含义而不是显示方式来表示字符的。

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

因此,恕我直言, String.trim() 的 Java 实现没有按预期执行,并且底层 Character.isWhitespace() 函数有问题。

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

我的猜测是 Java 实现者根据在控件内执行文本换行的需要编写了 isWhitespace()。他们应该将这个函数命名为 isWordWrappingBoundary() 或更明确的名称,并对 trim() 使用限制较少的空白测试。

回答by Jesper

Since Java 5 there is also an isSpaceChar(int)method. Does that not do what you want?

从 Java 5 开始,还有一种isSpaceChar(int)方法。这不是你想要的吗?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

确定指定的字符(Unicode 代码点)是否为 Unicode 空格字符。当且仅当 Unicode 标准将字符指定为空格字符时,才认为该字符是空格字符。如果角色的一般类别类型是以下任何一种,则此方法返回 true:...

回答by Grégory Joseph

As posted above, isSpaceChar(int)will provide the OP with a track to the answer. It seems fairly discreetly documented, but this method is actually useable with regexes. So:

如上所述,isSpaceChar(int)将为 OP 提供答案的轨迹。它似乎相当谨慎地记录在案,但这种方法实际上可用于 regexes。所以:

    "X\u00A0X X".replaceAll("\p{javaSpaceChar}", "_");

will produce a "X_X_X" string. It is left as an exercise for the reader to come up with the regex to trim a string. (Pattern with some flags should do the trick.)

将产生一个“X_X_X”字符串。留给读者练习使用正则表达式来修剪字符串。(带有一些标志的模式应该可以解决问题。)

回答by Maze

Also be cautious when using the apache commons function StringUtils.isBlank()(and related functions) which has the same strange isWhitespacebehavior, i.e. a non-breaking space is considered to be non-blank.

使用具有相同奇怪isWhitespace行为的 apache 公共函数StringUtils.isBlank()(和相关函数)时也要小心,即不间断空格被认为是非空白。