Java中的正则表达式:如何处理换行符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3445326/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
RegEx in Java: how to deal with newline
提问by user415663
I am currently trying to learn how to use regular expressions so please bear with my simple question. For example, say I have an input file containing a bunch of links separated by a newline:
我目前正在尝试学习如何使用正则表达式,所以请耐心回答我的简单问题。例如,假设我有一个包含一堆由换行符分隔的链接的输入文件:
www.foo.com/Archives/monkeys.htm
Description of Monkey's website.www.foo.com/Archives/pigs.txt
Description of Pig's website.www.foo.com/Archives/kitty.txt
Description of Kitty's website.www.foo.com/Archives/apple.htm
Description of Apple's website.
www.foo.com/Archives/monkeys.htm
Monkey 网站的描述。www.foo.com/Archives/pigs.txt
Pig 网站的描述。www.foo.com/Archives/kitty.txt
Kitty 网站的描述。www.foo.com/Archives/apple.htm
Apple 网站的描述。
If I wanted to get one website along with its description, this regex seems to work on a testing tool: .*www.*\\s.*Pig.*
如果我想获得一个网站及其描述,这个正则表达式似乎适用于一种测试工具: .*www.*\\s.*Pig.*
However, when I try running it within my code it doesn't seem to work. Is this expression correct? I tried replacing "\s" with "\n" and it doesn't seem to work still.
但是,当我尝试在我的代码中运行它时,它似乎不起作用。这个表达正确吗?我尝试用“\n”替换“\s”,但它似乎仍然不起作用。
回答by maerics
Works for me:
对我有用:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Foo {
public static void main(String args[]) {
Pattern p = Pattern.compile(".*www.*\s.*Pig.*");
String s = "www.foo.com/Archives/monkeys.htm\n"
+ "Description of Monkey's website.\n"
+ "\n"
+ "www.foo.com/Archives/pigs.txt\n"
+ "Description of Pig's website.\n"
+ "\n"
+ "www.foo.com/Archives/kitty.txt\n"
+ "Description of Kitty's website.\n"
+ "\n"
+ "www.foo.com/Archives/apple.htm\n"
+ "Description of Apple's website.\n";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("ERR: no match");
}
}
}
Perhaps the problem was with the way you were using the Pattern and Matcher objects?
也许问题在于您使用 Pattern 和 Matcher 对象的方式?
回答by Alan Moore
The lines are probably separated by \r\n
in your file. Both \r
(carriage return) and \n
(linefeed) are considered line-separator characters in Java regexes, and the .
metacharacter won't match either of them. \s
will match those characters, so it consumes the \r
, but that leaves .*
to match the \n
, which fails. Your tester probably used just \n
to separate the lines, which was consumed by \s
.
这些行可能\r\n
在您的文件中以分隔。这两个\r
(回车)和\n
(换行)被视为行分隔符在Java中的正则表达式和.
元字符也不会配合他们的。 \s
将匹配这些字符,因此它会消耗\r
,但是.*
匹配\n
失败的 。您的测试人员可能只是\n
用来分隔由\s
.
If I'm right, changing the \s
to \s+
or [\r\n]+
should get it to work. That's probably all you need to do in this case, but sometimes you have to match exactly one line separator, or at least keep track of how many you're matching. In that case you need a regex that matches exactly one of any of the three most common line separator types: \r\n
(Windows/DOS), \n
(Unix/Linus/OSX) and \r
(older Macs). Either of these will do:
如果我是对的,更改\s
为\s+
或[\r\n]+
应该让它工作。在这种情况下,这可能就是您需要做的所有事情,但有时您必须精确匹配一个行分隔符,或者至少跟踪您匹配的数量。在这种情况下,您需要一个与以下三种最常见的行分隔符类型中的任何一种完全匹配的正则表达式:\r\n
(Windows/DOS)、\n
(Unix/Linus/OSX) 和\r
(较旧的 Mac)。这些都可以:
\r\n|[\r\n]
\r\n|\n|\r
Update:As of Java 8 we have another option, \R
. It matches any line separator, including not just \r\n
, but several others as defined by the Unicode standard. It's equivalent to this:
更新:从 Java 8 开始,我们还有另一个选择, \R
. 它匹配任何行分隔符,不仅包括 ,还包括Unicode 标准\r\n
定义的其他几个。这相当于:
\r\n|[\n\x0B\x0C\r\u0085\u2028\u2029]
Here's how you might use it:
以下是您可以如何使用它:
(?im)^.*www.*\R.*Pig.*$
The i
option makes it case-insensitive, and the m
puts it in multiline mode, allowing ^
and $
to match at line boundaries.
该i
选项使其不区分大小写,m
并将其置于多行模式,允许^
和$
匹配行边界。
回答by user414661
try this
尝试这个
([^\r]+\r[^\r])+
回答by Gary
This version matches newlines that may be either Windows (\r\n) or Unix (\n)
此版本匹配可能是 Windows (\r\n) 或 Unix (\n) 的换行符
Pattern p = Pattern.compile("(www.*)((\r\n)|(\n))(.*Pig.*)");
String s = "www.foo.com/Archives/monkeys.htm\n"
+ "Description of Monkey's website.\n"
+ "\r\n"
+ "www.foo.com/Archives/pigs.txt\r\n"
+ "Description of Pig's website.\n"
+ "\n"
+ "www.foo.com/Archives/kitty.txt\n"
+ "Description of Kitty's website.\n"
+ "\n"
+ "www.foo.com/Archives/apple.htm\n"
+ "Description of Apple's website.\n";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println("found: "+m.group());
System.out.println("website: "+m.group(1));
System.out.println("description: "+m.group(5));
}
System.out.println("done");
回答by javaPhobic
For future reference, one can also use the Pattern.DOTALL flag for "." to match even \r or \n.
为了将来参考,还可以将 Pattern.DOTALL 标志用于“。” 甚至匹配 \r 或 \n。
Example:
例子:
Say the we are parsing a single string of http header lines like this (each line ended with \r\n)
假设我们正在解析这样的单个 http 标题行字符串(每行以 \r\n 结尾)
HTTP/1.1 302 Found
Server: Apache-Coyote/1.1
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: 0
X-Frame-Options: SAMEORIGIN
Location: http://localhost:8080/blah.htm
Content-Length: 0
This pattern:
这种模式:
final static Pattern PATTERN_LOCATION = Pattern.compile(".*?Location\: (.*?)\r.*?", Pattern.DOTALL);
Can parse the location value using "matcher.group(1)".
可以使用“matcher.group(1)”解析位置值。
The "." in the above pattern will match \r and \n, so the above pattern can actually parse the 'Location' from the http header lines, where there might be other headers before or after the target line (not that this is a recommended way to parse http headers).
这 ”。” 在上面的模式中将匹配\r 和\n,所以上面的模式实际上可以从http 标头行解析“位置”,在目标行之前或之后可能有其他标头(不是这是推荐的方法)解析 http 标头)。
Also, you can use "?s" inside the pattern to achieve the same effect.
此外,您可以在模式中使用“?s”来达到相同的效果。
If you are doing this, you might be better off using Matcher.find().
如果您这样做,最好使用 Matcher.find()。