Java Regex 在匹配中包含新行

Question

提问by Paul Nelson Baker

I'm trying to match a regular expression to textbook definitions that I get from a website. The definition always has the word with a new line followed by the definition. For example:

我正在尝试将正则表达式与我从网站获得的教科书定义相匹配。定义中的单词总是有一个新行，后跟定义。例如：

Zither
 Definition: An instrument of music used in Austria and Germany It has from thirty to forty wires strung across a shallow sounding board which lies horizontally on a table before the performer who uses both hands in playing on it Not to be confounded with the old lute shaped cittern or cithern

In my attempts to get just the word (in this case "Zither") I keep getting the newline character.

在我试图只得到这个词（在这种情况下是“Zither”）时，我不断得到换行符。

I tried both ^(\w+)\sand ^(\S+)\swithout much luck. I thought that maybe ^(\S+)$would work, but that doesn't seem to successfully match the word at all. I've been testing with rubular, http://rubular.com/r/LPEHCnS0ri; which seems to successfully match all my attempts the way I want, despite the fact that Java doesn't.

我两个都试过^(\w+)\s，^(\S+)\s但运气不佳。我认为这可能^(\S+)$会奏效，但这似乎根本无法匹配这个词。我一直在用 rubular 进行测试，http://rubular.com/r/LPEHCnS0ri ；尽管Java没有，但它似乎以我想要的方式成功地匹配了我的所有尝试。

Here's my snippet

这是我的片段

String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
Pattern rgx = Pattern.compile("^(\S+)$");
Matcher mtch = rgx.matcher(str);
if (mtch.find()) {
    String result = mtch.group();
    terms.add(new SearchTerm(result, System.nanoTime()));
}

This is easily solved by triming the resulting string, but that seems like it should be unnecessary if I'm already using a regular expression.

这可以通过修剪结果字符串轻松解决，但如果我已经在使用正则表达式，这似乎是不必要的。

All help is greatly appreciated. Thanks in advance!

非常感谢所有帮助。提前致谢！

Answer 1

采纳答案by Adrian Pronk

Try using the Pattern.MULTILINE option

尝试使用 Pattern.MULTILINE 选项

Pattern rgx = Pattern.compile("^(\S+)$", Pattern.MULTILINE);

This causes the regex to recognise line delimiters in your string, otherwise ^and $just match the start and end of the string.

这使得正则表达式识别线分隔符在字符串中，否则^和$只是匹配字符串的开始和结束。

Although it makes no difference for this pattern, the Matcher.group()method returns the entire match, whereas the Matcher.group(int)method returns the match of the particular capture group (...)based on the number you specify. Your pattern specifies one capture group which is what you want captured. If you'd included \sin your Pattern as you wrote you tried, then Matcher.group()would have included that whitespace in its return value.

尽管此模式没有区别，但该Matcher.group()方法返回整个匹配项，而该Matcher.group(int)方法(...)根据您指定的数字返回特定捕获组的匹配项。您的模式指定了一个捕获组，这就是您要捕获的内容。如果您\s在尝试编写时将其包含在 Pattern 中，那么Matcher.group()将在其返回值中包含该空格。

Answer 2

回答by Paul Vargas

Try the next:

尝试下一个：

/* The regex pattern: ^(\w+)\r?\n(.*)$ */
private static final REGEX_PATTERN = 
        Pattern.compile("^(\w+)\r?\n(.*)$");

public static void main(String[] args) {
    String input = "Zither\n Definition: An instrument of music";

    System.out.println(
        REGEX_PATTERN.matcher(input).matches()
    );  // prints "true"

    System.out.println(
        REGEX_PATTERN.matcher(input).replaceFirst(" = ")
    );  // prints "Zither =  Definition: An instrument of music"

    System.out.println(
        REGEX_PATTERN.matcher(input).replaceFirst("")
    );  // prints "Zither"
}

Answer 3

回答by Mike Dinescu

With regular expressions the first group is always the complete matching string. In your case you want group 1, not group 0.

对于正则表达式，第一组始终是完整的匹配字符串。在您的情况下，您需要第 1 组，而不是第 0 组。

So changing mtch.group()to mtch.group(1)should do the trick:

因此，更改mtch.group()为mtch.group(1)应该可以解决问题：

 String str = ...; //Here the string is assigned a word and definition taken from the internet like given in the example above.
 Pattern rgx = Pattern.compile("^(\w+)\s");
 Matcher mtch = rgx.matcher(str);
 if (mtch.find()) {
     String result = mtch.group(1);
     terms.add(new SearchTerm(result, System.nanoTime()));
 }

Answer 4

回答by Anthony Accioly

Just replace:

只需更换：

String result = mtch.group();

By:

经过：

String result = mtch.group(1);

This will limit your output to the contents of the capturing group(e.g. (\\w+)) .

这会将您的输出限制为捕获组的内容（例如(\\w+)）。

Answer 5

回答by Varun Garg

A late response, but if you are not using Pattern and Matcher, you can use this alternative of DOTALLin your regex string

迟到的响应，但如果您不使用模式和匹配器，则可以DOTALL在正则表达式字符串中使用此替代方法

(?s)[Your Expression]

Basically (?s)also tells dot to match all characters, including line breaks

基本上(?s)还告诉 dot 匹配所有字符，包括换行符

Detailed information: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

详细信息：http: //www.vogella.com/tutorials/JavaRegularExpressions/article.html

Java Regex 在匹配中包含新行

提问by Paul Nelson Baker

采纳答案by Adrian Pronk

回答by Paul Vargas

回答by Mike Dinescu

回答by Anthony Accioly

回答by Varun Garg

相关推荐

最近更新

标签

Java Regex 在匹配中包含新行

提问by Paul Nelson Baker

采纳答案by Adrian Pronk

回答by Paul Vargas

回答by Mike Dinescu

回答by Anthony Accioly

回答by Varun Garg

相关推荐

Java Android Calendar 获取当前星期几作为字符串

如何浏览本地 Java App Engine 数据存储？

Java Android 数据绑定：更改属性时视图不会更新

以 java.sql.Date 格式获取当前日期

相关推荐

最近更新

标签