Java string.split("\\S") 如何工作
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26280879/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How does string.split("\\S") work
提问by Frank Brosnan
I was doing a question out of the book oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 by Ganesh and Sharma.
我在 Ganesh 和 Sharma 的书 oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 中提出了一个问题。
One question is:
一个问题是:
Consider the following program and predict the output:
class Test { public static void main(String args[]) { String test = "I am preparing for OCPJP"; String[] tokens = test.split("\S"); System.out.println(tokens.length); } }
a) 0
b) 5
c) 12
d) 16
考虑以下程序并预测输出:
class Test { public static void main(String args[]) { String test = "I am preparing for OCPJP"; String[] tokens = test.split("\S"); System.out.println(tokens.length); } }
一)0
b) 5
c) 12
d) 16
Now I understand that \S is a regex means treat non-space chars as the delimiters. But I was puzzled as to how the regex expression does its matching and what are the actual tokens produced by split.
现在我明白 \S 是一个正则表达式意味着将非空格字符视为分隔符。但我对正则表达式如何进行匹配以及 split 产生的实际标记感到困惑。
I added code to print out the tokens as follows
我添加了代码来打印出如下令牌
for (String str: tokens){
System.out.println("<" + str + ">");
}
and I got the following output
我得到了以下输出
16
<>
< >
<>
< >
<>
<>
<>
<>
<>
<>
<>
<>
< >
<>
<>
< >
So a lot of empty string tokens. I just do not understand this.
所以很多空字符串标记。我只是不明白这一点。
I would have thought along the lines that if delimiters are non space chars that in the above text then all alphabetic chars serve as delimiters so maybe there should be 21 tokens if we are matching tokens that result in empty strings too. I just don't understand how Java's regex engine is working this out. Are there any regex gurus out there who can shed light on this code for me?
我本来会想,如果分隔符是上面文本中的非空格字符,那么所有字母字符都用作分隔符,所以如果我们匹配导致空字符串的标记,那么可能应该有 21 个标记。我只是不明白 Java 的正则表达式引擎是如何解决这个问题的。有没有正则表达式大师可以为我阐明这段代码?
采纳答案by PeterK
First things start with \s
(lower case), which is a regular expression character class for white space, that is space ' ' tabs '\t', new line chars '\n' and '\r', vertical tab '\v' and a bunch of other characters.
首先从\s
(小写)开始,它是空格的正则表达式字符类,即空格''制表符'\t',换行符'\n'和'\r',垂直制表符'\v'和一堆其他角色。
\S
(upper case) is the opposite of this, so that would mean any non white space character.
\S
(大写)与此相反,因此这意味着任何非空白字符。
So when you split this String "I am preparing for OCPJP
" using \S
you are effectively splitting the string at every letter. The reason your token array has a length of 16.
因此,当您拆分此字符串 " I am preparing for OCPJP
" 时,\S
您实际上是在每个字母处拆分字符串。您的令牌数组长度为 16 的原因。
Now as for why these are empty.
现在至于为什么这些是空的。
Consider the following String: Hello,World
, if we were to split that using ,
, we would end up with a String array of length 2, with the following contents: Hello
and World
. Notice that the ,
is not in either of the Strings, it has be erased.
考虑下面的 String: Hello,World
,如果我们要使用 拆分它,
,我们最终会得到一个长度为 2 的字符串数组,其内容如下:Hello
和World
。请注意,,
不在任何一个字符串中,它已被删除。
The same thing has happened with the I am preparing for OCPJP
String, it has been split, and the points matched by your regex are not in any of the returned values. And because most of the letters in that String are followed by another letter, you end up with a load of Strings of length zero, only the white space characters are preserved.
I am preparing for OCPJP
字符串也发生了同样的事情,它已被拆分,并且您的正则表达式匹配的点不在任何返回值中。并且因为该字符串中的大多数字母后跟另一个字母,所以最终会加载长度为零的字符串,仅保留空白字符。
回答by Pablo Lozano
Copied from the API documentation: (bold are mine)
从 API文档中复制:(粗体是我的)
public String[] split(String regex)
Splits this string around matches of the given regular expression. This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
The string "boo:and:foo", for example, yields the following results with these expressions:
Regex Result : { "boo", "and", "foo" } o { "b", "", ":and:f" }
public String[] split(String regex)
围绕给定正则表达式的匹配拆分此字符串。此方法的工作方式就像通过使用给定表达式和零限制参数调用双参数 split 方法一样。因此,结果数组中不包含尾随空字符串。
例如,字符串 "boo:and:foo" 使用这些表达式产生以下结果:
Regex Result : { "boo", "and", "foo" } o { "b", "", ":and:f" }
Check the second example, where last 2 "o" are just removed: the answer for your question is "OCPJP"
substring is treated as a collection of separators which is not followed for non-empty strings, so that part is trimmed.
检查第二个示例,其中最后 2 个“o”刚刚被删除:您的问题的答案是"OCPJP"
子字符串被视为非空字符串不跟随的分隔符集合,因此该部分被修剪。
回答by ajb
The reason the result is 16 and not 21 is this, from the javadoc for Split
:
结果是 16 而不是 21 的原因是这个,来自javadoc forSplit
:
Trailing empty strings are therefore not included in the resulting array.
因此,结果数组中不包含尾随空字符串。
This means, for example, that if you say
这意味着,例如,如果你说
"/abc//def/ghi///".split("/")
the result will have five elements. The first will be ""
, since it's not a trailing empty string; the others will be "abc"
, ""
, "def"
, and "ghi"
. But the remaining empty strings are removed from the array.
结果将有五个元素。第一个将是""
,因为它不是尾随的空字符串;别人会"abc"
,""
,"def"
,和"ghi"
。但是剩余的空字符串将从数组中删除。
In the posted case:
在发布的案例中:
"I am preparing for OCPJP".split("\S")
it's the same thing. Since non-space characters are delimiters, each letter is a delimiter, butthe OCPJP letters essentially don't count, because those delimiters result in trailing empty strings that are then discarded. So, since there are 15 letters in "I am preparing for"
, they are treated as delimiting 16 substrings (the first is ""
and the last is " "
).
这是同一件事。由于非空格字符是分隔符,因此每个字母都是一个分隔符,但OCPJP 字母基本上不计算在内,因为这些分隔符会导致尾随空字符串被丢弃。因此,由于 中有 15 个字母"I am preparing for"
,它们被视为分隔 16 个子字符串(第一个是""
,最后一个是" "
)。