Java正则表达式匹配_all_空白字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1822772/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java regular expression to match _all_ whitespace characters
提问by Carsten
I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match
and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
我在 Java 中寻找匹配字符串中所有空白字符的正则表达式。"\s" 只匹配一些,它不匹配
和类似的非 ascii 空格。我正在寻找一个正则表达式,它匹配 Java 字符串中可能出现的所有(常见)空白字符。
[Edit]
[编辑]
To clarify: I do not mean the string sequence "
" I mean the sincle unicode character U+00A0 that is often represented by "
", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
为了澄清:我的意思不是字符串序列“
”我的意思是sincle Unicode字符U + 00A0经常被表示为“
”,例如HTML,并与类似的白色空间meainig其他所有Unicode字符,如“窄NO -BREAK SPACE (U+202F)、在 Unicode 3.2 及更高版本中编码为 U+2060 的 Word 连接符、“零宽度无中断空间”(U+FEFF) 以及任何其他可以被视为空白的字符。
[Answer]
[回答]
For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:
对于我的目的,即捕获所有空白字符,unicode + 传统,以下表达式可以完成这项工作:
[\p{Z}\s]
[\p{Z}\s]
The answer is in the comments below but since it is a bit hidden I repeat it here.
答案在下面的评论中,但由于它有点隐藏,我在这里重复。
采纳答案by Andomar
The
is only whitespace in HTML. Use an HTML parserto extract the plain text. and \s
should work just fine.
在
仅在HTML空白。使用HTML 解析器提取纯文本。并且\s
应该工作得很好。
回答by Vinko Vrsalovic
is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.
就正则表达式而言,不是空格字符。您需要修改正则表达式以包含除 \s 之外的那些字符串,例如 /(\s| |%20)/,或者之前解析字符串内容以获取数据的 ASCII 或 Unicode 表示。
You are mixing abstraction levels here.
您在这里混合了抽象级别。
If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z}
or \p{Zs}
will do the work.
如果在仔细阅读问题后似乎是这种情况,您正在寻找一种方法来匹配引用标准 ASCII 的所有空白字符加上空白代码点,\p{Z}
或者\p{Zs}
将完成这项工作。
You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).
你真的应该澄清你的问题,因为它误导了很多人(甚至做出正确的答案却遭到了一些反对)。
回答by Zak
is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up
javax.swing.text.html
不是空白。它是一个字符编码序列,表示 HTML 中的空格。在对它运行字符串匹配之前,您很可能希望将 HTML 编码的文本转换为纯文本。如果是这种情况,请查找 javax.swing.text.html
回答by peter.murray.rust
The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:
regex 字符是唯一独立于编码的字符。以下是一些非打印字符的列表 - 在 Unicode 中 -
回答by BalusC
You clarified the question the way as I expected: you're actually not looking for the String literal
as many here seem to think and for which the solution is too obvious.
您以我预期的方式澄清了这个问题:您实际上并没有
像这里的许多人认为的那样寻找 String 文字,并且解决方案太明显了。
Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]"
.
好吧,不幸的是,没有办法使用正则表达式来匹配它们。最好的是包括在图案中的特定码点,例如:"[\\s\\xA0]"
。
Editas turned out in one of the comments, you could use the undocumented"\\p{Z}"
for this. Alan, can you please leave comment how you found that out? This one is quite useful.
在其中一条评论中进行了编辑,您可以为此使用未记录"\\p{Z}"
的内容。艾伦,你能留下评论你是如何发现的吗?这个非常有用。
回答by Kevin Bourrillion
Here's a summary I made of several competing definitions of "whitespace":
这是我对“空白”的几个相互竞争的定义所做的总结:
http://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ
http://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ
You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.
您可能最终不得不明确列出您关心的与预制件之一不匹配的其他附加件。
回答by skia.heliou
In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392
如果有人再次遇到这个问题寻求帮助,我建议寻求以下答案:https: //stackoverflow.com/a/6255512/1678392
The short version: \\p{javaSpaceChar}
简短版本: \\p{javaSpaceChar}
Why: Per the Pattern class, this maps the Character.isSpaceCharmethod:
为什么:根据Pattern 类,这映射Character.isSpaceChar方法:
Categories that behave like the java.lang.Character boolean ismethodnamemethods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
行为类似于 java.lang.Character 布尔值的类别是methodname方法(不推荐使用的方法除外)可通过相同的 \p{ prop} 语法获得,其中指定的属性具有名称 java methodname。