Java 使用正则表达式 \w \w* 拆分字符串?\w+?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9760909/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 07:20:48  来源:igfitidea点击:

Split String with regex \w \w*? \w+?

javaregex

提问by Kennet

I'm learning regexp and thought I was starting to get a grip. but then...

我正在学习正则表达式,并认为我开始掌握了。但是之后...

I tried to split a string and I need help to understand such a simple thing as:

我试图拆分一个字符串,我需要帮助来理解这样一个简单的事情:

String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\w " + Arrays.toString(input.split("\w")));
System.out.println("\w*? " + Arrays.toString(input.split("\w*?")));
System.out.println("\w+? " + Arrays.toString(input.split("\w+?")));

The output is
[a-z] - []
\w    - []
\w*?  - [, a, b, c, d, e]
\w+?  - []

Why doesn't any of the two first lines split the String on any character? The third expression \w*?, (question mark prevents greediness) works as I expected, splitting the String on every character. The star, zero or more matches, returns an empty array.

为什么前两行中的任何一行都没有在任何字符上拆分字符串?第三个表达式 \w*?, (问号防止贪婪)按我的预期工作,在每个字符上拆分字符串。零个或多个匹配的星号返回一个空数组。

I've tried the expression within NotePad++ and in a program and it shows 5 matches as in:

我已经在 NotePad++ 和程序中尝试过这个表达式,它显示了 5 个匹配项,如下所示:

Scanner ls = new Scanner(input);
while(ls.hasNext())
    System.out.format("%s ", ls.findInLine("\w");

Output is: a b c d e

This really puzzles me

这真的让我很困惑

采纳答案by Joey

If you split a string with a regex, you essentially tell where the string should be cut. This necessarily cuts away what you match with the regex. Which means if you split at \w, then every character is a split point and the substrings between them (all empty) are returned. Java automatically removes trailing empty strings, as described in the documentation.

如果您使用正则表达式拆分字符串,您实际上是在告诉字符串应该在何处剪切。这必然会删除您与正则表达式匹配的内容。这意味着如果在 处拆分\w,则每个字符都是一个拆分点,并且返回它们之间的子字符串(全部为空)。Java 会自动删除尾随的空字符串,如文档所述

This also explains why the lazy match \w*?will give you every character, because it will match every position between (and before and after) any character (zero-width). What's left are the characters of the string themselves.

这也解释了为什么惰性匹配\w*?会为您提供每个字符,因为它会匹配任何字符(零宽度)之间(以及之前和之后)的每个位置。剩下的是字符串本身的字符。

Let's break it down:

让我们分解一下:

  1. [a-z], \w, \w+?

    Your string is

    abcde
    

    And the matches are as follows:

     a  b  c  d  e
    └─┘└─┘└─┘└─┘└─┘
    

    which leaves you with the substrings betweenthe matches, all of which are empty.

    The above three regexes behave the same in this regard as they all will only match a single character. \w+?will do so because it lacks any other constraints that might make the +?try matching more than just the bare minimum (it's lazy, after all).

  2. \w*?

      a  b  c  d  e
    └┘ └┘ └┘ └┘ └┘ └┘
    

    In this case the matches are betweenthe characters, leaving you with the following substrings:

    "", "a", "b", "c", "d", "e", ""
    

    Java throws the trailing empty one away, though.

  1. [a-z], \w,\w+?

    你的字符串是

    abcde
    

    比赛如下:

     a  b  c  d  e
    └─┘└─┘└─┘└─┘└─┘
    

    这给你留下了匹配之间的子字符串,所有这些都是空的。

    以上三个正则表达式在这方面的行为相同,因为它们都只匹配单个字符。\w+?会这样做是因为它缺少任何其他可能使+?try 匹配不仅仅是最低限度的约束(毕竟它是懒惰的)。

  2. \w*?

      a  b  c  d  e
    └┘ └┘ └┘ └┘ └┘ └┘
    

    在这种情况下,匹配字符之间,留下以下子字符串:

    "", "a", "b", "c", "d", "e", ""
    

    不过,Java 会将尾随的空的扔掉。

回答by Gumbo

String.splitcuts the string at each match of the pattern:

String.split在模式的每个匹配处剪切字符串:

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string.

此方法返回的数组包含此字符串的每个子字符串,这些子字符串由与给定表达式匹配的另一个子字符串终止或由字符串的末尾终止。

So whenever the pattern like [a-z]is matched, the string is cut at that match. As every character in your string is matched by the pattern, the resulting array is empty (trailing empty strings are removed).

因此,每当[a-z]匹配模式 like 时,该字符串就会在该匹配处被切断。由于字符串中的每个字符都与模式匹配,因此结果数组为空(删除尾随的空字符串)。

The same applies for \wand \w+?(one or more \wbut as little repetitions as possible). That \w*?results in something that you expected is due to the *?quantifier as that will match zero repetitions if possible, so an empty string. And an empty string is found at each position in the given string.

这同样适用于\w\w+?(一个或多个,\w但尽可能少的重复)。这会\w*?导致您预期的结果是由于*?量词,因为如果可能,它将匹配零次重复,因此是一个空字符串。并且在给定字符串的每个位置找到一个空字符串。

回答by maerics

Let's break down each of those calls to String#split(String). It's key to notice from the Java docs that the "method works as if by invoking the two-argument split methodwith the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array."

让我们分解每个对 的调用String#split(String)。从 Java 文档中注意到这一点很重要,“该方法的工作方式就像通过使用给定的表达式和零限制参数调用双参数 split 方法一样。因此,尾随空字符串不包含在结果数组中。”

"abcde".split("[a-z]"); // => []

This one matches every character (a, b, c, d, e) and results in only the empty strings between them, which are omitted.

这个匹配每个字符 (a, b, c, d, e) 并且只产生它们之间的空字符串,它们被省略。

"abcde".split("\w")); // => []

Again, every character in the string is a word character (\w), so the result is empty strings, which are omitted.

同样,字符串中的每个字符都是一个单词字符 ( \w),因此结果是空字符串,将其省略。

"abcde".split("\w*?")); // => ["", "a", "b", "c", "d", "e"]

In this case, the *means "zero or more of the preceding item" (\w) which matches the empty expression seven times (once at the beginning of the string then once between each character). So we get the first empty string then each character.

在这种情况下,*意思是“前一项的零个或多个”( \w) 匹配空表达式七次(一次在字符串的开头,然后在每个字符之间一次)。所以我们得到第一个空字符串,然后是每个字符。

"abcde".split("\w+?")); // => []

Here the +means "one or more of the preceding item" (\w) which matches the entire input string, resulting in only the empty string, which is omitted.

这里的+意思是“一个或多个前面的项目”( \w) 匹配整个输入字符串,导致只有空字符串,被省略。

Try these examples again with input.split(regex, -1)and you should see all of the empty strings.

再次尝试这些示例input.split(regex, -1),您应该会看到所有空字符串。