python多行正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18943223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python multiline regular expressions
提问by AKASH
How do I extract all the characters (including newline characters) until the first occurrence of the giver sequence of words? For example with the following input:
如何提取所有字符(包括换行符),直到第一次出现给定词序列?例如使用以下输入:
input text:
输入文本:
"shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"
And the sequence the
I want to extract text from shantaram
to first occurrence of the
which is in the second line.
the
我想从中提取文本shantaram
到第一次出现的序列the
在第二行。
The output must be-
输出必须是-
shantaram is an amazing novel.
It is one of the
I have been trying all morning. I can write the expression to extract all characters until it encounters a specific character but here if I use an expression like:
我整个上午都在努力。我可以编写表达式来提取所有字符,直到遇到特定字符,但如果我使用如下表达式:
re.search("shantaram[\s\S]*the", string)
It doesn't match across newline.
它与换行符不匹配。
回答by Chris Seymour
You want to use the DOTALL
option to match across newlines. From doc.python.org:
您想使用该DOTALL
选项来匹配换行符。来自doc.python.org:
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
重新打点
制作'.' 特殊字符完全匹配任何字符,包括换行符;没有这个标志,'.' 将匹配除换行符以外的任何内容。
Demo:
演示:
In [1]: import re
In [2]: s="""shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"""
In [3]: print re.findall('^.*?the',s,re.DOTALL)[0]
shantaram is an amazing novel.
It is one of the
回答by rlms
A solution not using regex:
不使用正则表达式的解决方案:
from itertools import takewhile
def upto(a_string, stop):
return " ".join(takewhile(lambda x: x != stop and x != "\n".format(stop), a_string))
回答by lancif
Use this regex,
使用这个正则表达式,
re.search("shantaram[\s\S]*?the", string)
instead of
代替
re.search("shantaram[\s\S]*the", string)
The only difference is '?'. By using '?'(e.g. *?, +?), you can prevent longest matching.
唯一的区别是“?”。通过使用'?'(例如*?, +?),您可以防止最长匹配。