Java 使用正则表达式匹配多行文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/3651725/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Match multiline text using regular expression
提问by Nivas
I am trying to match a multi line text using java. When I use the Patternclass with the Pattern.MULTILINEmodifier, I am able to match, but I am not able to do so with (?m).
我正在尝试使用 java 匹配多行文本。当我使用Pattern带有Pattern.MULTILINE修饰符的类时,我能够匹配,但我不能这样做(?m).
The same pattern with (?m)and using String.matchesdoes not seem to work.
使用(?m)和使用相同的模式String.matches似乎不起作用。
I am sure I am missing something, but no idea what. Am not very good at regular expressions.
我确定我错过了一些东西,但不知道是什么。我不是很擅长正则表达式。
This is what I tried
这是我试过的
String test = "User Comments: This is \t a\ta \n test \n\n message \n";
String pattern1 = "User Comments: (\W)*(\S)*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find());  //true
String pattern2 = "(?m)User Comments: (\W)*(\S)*";
System.out.println(test.matches(pattern2));  //false - why?
采纳答案by Tim Pietzcker
First, you're using the modifiers under an incorrect assumption.
首先,您在错误的假设下使用了修饰符。
Pattern.MULTILINEor (?m)tells Java to accept the anchors ^and $to match at the start and end of each line (otherwise they only match at the start/end of the entire string).
Pattern.MULTILINE或者(?m)告诉 Java 接受锚点^并$在每行的开头和结尾进行匹配(否则它们只在整个字符串的开头/结尾匹配)。
Pattern.DOTALLor (?s)tells Java to allow the dot to match newline characters, too.
Pattern.DOTALL或者(?s)告诉 Java 也允许点匹配换行符。
Second, in your case, the regex fails because you're using the matches()method which expects the regex to match the entirestring - which of course doesn't work since there are some characters left after (\\W)*(\\S)*have matched.
其次,在您的情况下,正则表达式失败,因为您使用的matches()方法期望正则表达式匹配整个字符串 - 这当然不起作用,因为(\\W)*(\\S)*匹配后还剩下一些字符。
So if you're simply looking for a string that starts with User Comments:, use the regex
因此,如果您只是在寻找以 开头的字符串User Comments:,请使用正则表达式
^\s*User Comments:\s*(.*)
with the Pattern.DOTALLoption:
与Pattern.DOTALL选项:
Pattern regex = Pattern.compile("^\s*User Comments:\s+(.*)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
    ResultString = regexMatcher.group(1);
} 
ResultStringwill then contain the text after User Comments:
ResultString然后将包含之后的文本 User Comments:
回答by Amarghosh
str.matches(regex)behaves likePattern.matches(regex,  str)which attempts to match the entire input sequence against the pattern and returns
str.matches(regex)行为类似于Pattern.matches(regex,  str)尝试将整个输入序列与模式匹配并返回
trueif, and only if, the entireinput sequence matches this matcher's pattern
true当且仅当整个输入序列匹配此匹配器的模式
Whereas matcher.find()attempts to findthe next subsequence of the input sequence that matches the pattern and returns 
而matcher.find()尝试找到与模式匹配的输入序列的下一个子序列并返回
trueif, and only if, a subsequenceof the input sequence matches this matcher's pattern
true当且仅当,一个子输入序列的此匹配的模式匹配
Thus the problem is with the regex. Try the following.
因此问题出在正则表达式上。请尝试以下操作。
String test = "User Comments: This is \t a\ta \ntest\n\n message \n";
String pattern1 = "User Comments: [\s\S]*^test$[\s\S]*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find());  //true
String pattern2 = "(?m)User Comments: [\s\S]*^test$[\s\S]*";
System.out.println(test.matches(pattern2));  //true
Thus in short, the (\\W)*(\\S)*portion in your first regex matches an empty string as *means zero or more occurrences and the real matched string is User Comments:and not the whole string as you'd expect. The second one fails as it tries to match the whole string but it can't as \\Wmatches a non word character, ie [^a-zA-Z0-9_]and the first character is T, a word character.
因此简而言之,(\\W)*(\\S)*您的第一个正则表达式中的部分与空字符串匹配,*表示出现零次或多次,并且真正匹配的字符串User Comments:不是您期望的整个字符串。第二个失败,因为它试图匹配整个字符串,但它不能\\W匹配一个非单词字符,即[^a-zA-Z0-9_]第一个字符是T,一个单词字符。
回答by Alan Moore
This has nothing to do with the MULTILINE flag; what you're seeing is the difference between the find()and matches()methods.  find()succeeds if a match can be found anywhere in the target string, while matches()expects the regex to match the entire string.
这与 MULTILINE 标志无关;你看到的是find()和matches()方法之间的区别。  find()如果可以在目标字符串的任何位置找到匹配项,则成功,而matches()期望正则表达式匹配整个字符串。
Pattern p = Pattern.compile("xyz");
Matcher m = p.matcher("123xyzabc");
System.out.println(m.find());    // true
System.out.println(m.matches()); // false
Matcher m = p.matcher("xyz");
System.out.println(m.matches()); // true
Furthermore, MULTILINEdoesn't mean what you think it does.  Many people seem to jump to the conclusion that you have to use that flag if your target string contains newlines--that is, if it contains multiple logical lines.  I've seen several answers here on SO to that effect, but in fact, all that flag does is change the behavior of the anchors, ^and $.  
此外,MULTILINE并不意味着你认为它做什么。许多人似乎得出结论,如果目标字符串包含换行符——也就是说,如果它包含多个逻辑行,则必须使用该标志。我已经在 SO 上看到了几个关于这种效果的答案,但实际上,该标志所做的只是改变锚点的行为,^并且$.  
Normally ^matches the very beginning of the target string, and $matches the very end (or before a newline at the end, but we'll leave that aside for now).  But if the string contains newlines, you can choose for ^and $to match at the start and end of any logical line, not just the start and end of the whole string, by setting the MULTILINE flag.
通常^匹配目标字符串的最开头,并$匹配最末尾(或末尾的换行符之前,但我们暂时将其搁置一旁)。但是,如果字符串包含换行符,则可以通过设置 MULTILINE 标志,选择在任何逻辑行的开头和结尾处匹配^和$匹配,而不仅仅是整个字符串的开头和结尾处。
So forget about what MULTILINEmeansand just remember what it does: changes the behavior of the ^and $anchors.  DOTALLmode was originally called "single-line" (and still is in some flavors, including Perl and .NET), and it has always caused similar confusion.  We're fortunate that the Java devs went with the more descriptive name in that case, but there was no reasonable alternative for "multiline" mode.  
所以忘记什么MULTILINE意思,只记住它的作用:改变^和$锚点的行为。  DOTALL模式最初被称为“单行”(现在仍然有一些风格,包括 Perl 和 .NET),它总是引起类似的混乱。我们很幸运,在这种情况下,Java 开发人员使用了更具描述性的名称,但是“多行”模式没有合理的替代方案。  
In Perl, where all this madness started, they've admitted their mistake and gotten rid of both "multiline" and "single-line" modes in Perl 6 regexes. In another twenty years, maybe the rest of the world will have followed suit.
在 Perl 中,所有这些疯狂都开始了,他们承认了他们的错误并摆脱了 Perl 6 正则表达式中的“多行”和“单行”模式。再过二十年,也许世界其他地方也会效仿。
回答by Yehuda Schwartz
The multiline flag tells regex to match the pattern to each line as opposed to the entire string for your purposes a wild card will suffice.
多行标志告诉正则表达式将模式匹配到每一行而不是整个字符串,为了您的目的,通配符就足够了。

