Java Regex 在匹配中包含新行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18261566/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java Regex is including new line in match
提问by Paul Nelson Baker
I'm trying to match a regular expression to textbook definitions that I get from a website. The definition always has the word with a new line followed by the definition. For example:
我正在尝试将正则表达式与我从网站获得的教科书定义相匹配。定义中的单词总是有一个新行,后跟定义。例如:
Zither
Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern
In my attempts to get just the word (in this case "Zither") I keep getting the newline character.
在我试图只得到这个词(在这种情况下是“Zither”)时,我不断得到换行符。
I tried both ^(\w+)\s
and ^(\S+)\s
without much luck. I thought that maybe ^(\S+)$
would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.
我两个都试过^(\w+)\s
,^(\S+)\s
但运气不佳。我认为这可能^(\S+)$
会奏效,但这似乎根本无法匹配这个词。我一直在用 rubular 进行测试,http://rubular.com/r/LPEHCnS0ri ;尽管Java没有,但它似乎以我想要的方式成功地匹配了我的所有尝试。
Here's my snippet
这是我的片段
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group();
terms.add(new SearchTerm(result, System.nanoTime()));
}
This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.
这可以通过修剪结果字符串轻松解决,但如果我已经在使用正则表达式,这似乎是不必要的。
All help is greatly appreciated. Thanks in advance!
非常感谢所有帮助。提前致谢!
采纳答案by Adrian Pronk
Try using the Pattern.MULTILINE option
尝试使用 Pattern.MULTILINE 选项
Pattern rgx = Pattern.compile("^(\S+)$", Pattern.MULTILINE);
This causes the regex to recognise line delimiters in your string, otherwise ^
and $
just match the start and end of the string.
这使得正则表达式识别线分隔符在字符串中,否则^
和$
只是匹配字符串的开始和结束。
Although it makes no difference for this pattern, the Matcher.group()
method returns the entire match, whereas the Matcher.group(int)
method returns the match of the particular capture group (...)
based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \s
in your Pattern as you wrote you tried, then Matcher.group()
would have included that whitespace in its return value.
尽管此模式没有区别,但该Matcher.group()
方法返回整个匹配项,而该Matcher.group(int)
方法(...)
根据您指定的数字返回特定捕获组的匹配项。您的模式指定了一个捕获组,这就是您要捕获的内容。如果您\s
在尝试编写时将其包含在 Pattern 中,那么Matcher.group()
将在其返回值中包含该空格。
回答by Paul Vargas
Try the next:
尝试下一个:
/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN =
Pattern.compile("^(\w+)\r?\n(.*)$");
public static void main(String[] args) {
String input = "Zither\n Definition: An instrument of music";
System.out.println(
REGEX_PATTERN.matcher(input).matches()
); // prints "true"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst(" = ")
); // prints "Zither = Definition: An instrument of music"
System.out.println(
REGEX_PATTERN.matcher(input).replaceFirst("")
); // prints "Zither"
}
回答by Mike Dinescu
With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.
对于正则表达式,第一组始终是完整的匹配字符串。在您的情况下,您需要第 1 组,而不是第 0 组。
So changing mtch.group()
to mtch.group(1)
should do the trick:
因此,更改mtch.group()
为mtch.group(1)
应该可以解决问题:
String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\w+)\s");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
String result = mtch.group(1);
terms.add(new SearchTerm(result, System.nanoTime()));
}
回答by Anthony Accioly
Just replace:
只需更换:
String result = mtch.group();
By:
经过:
String result = mtch.group(1);
This will limit your output to the contents of the capturing group(e.g. (\\w+)
) .
这会将您的输出限制为捕获组的内容(例如(\\w+)
)。
回答by Varun Garg
A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALL
in your regex string
迟到的响应,但如果您不使用模式和匹配器,则可以DOTALL
在正则表达式字符串中使用此替代方法
(?s)[Your Expression]
Basically (?s)
also tells dot to match all characters, including line breaks
基本上(?s)
还告诉 dot 匹配所有字符,包括换行符
Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
详细信息:http: //www.vogella.com/tutorials/JavaRegularExpressions/article.html