python多行正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18943223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:20:58  来源:igfitidea点击:

python multiline regular expressions

pythonregex

提问by AKASH

How do I extract all the characters (including newline characters) until the first occurrence of the giver sequence of words? For example with the following input:

如何提取所有字符(包括换行符),直到第一次出现给定词序列?例如使用以下输入:

input text:

输入文本:

"shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"

And the sequence theI want to extract text from shantaramto first occurrence of thewhich is in the second line.

the我想从中提取文本shantaram到第一次出现的序列the在第二行。

The output must be-

输出必须是-

shantaram is an amazing novel.
It is one of the

I have been trying all morning. I can write the expression to extract all characters until it encounters a specific character but here if I use an expression like:

我整个上午都在努力。我可以编写表达式来提取所有字符,直到遇到特定字符,但如果我使用如下表达式:

re.search("shantaram[\s\S]*the", string)

It doesn't match across newline.

它与换行符不匹配。

回答by Chris Seymour

You want to use the DOTALLoption to match across newlines. From doc.python.org:

您想使用该DOTALL选项来匹配换行符。来自doc.python.org

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

重新打点

制作'.' 特殊字符完全匹配任何字符,包括换行符;没有这个标志,'.' 将匹配除换行符以外的任何内容。

Demo:

演示:

In [1]: import re

In [2]: s="""shantaram is an amazing novel.
It is one of the best novels i have read.
the novel is written by gregory david roberts.
He is an australian"""

In [3]: print re.findall('^.*?the',s,re.DOTALL)[0]
shantaram is an amazing novel.
It is one of the

回答by rlms

A solution not using regex:

不使用正则表达式的解决方案:

from itertools import takewhile
def upto(a_string, stop):
    return " ".join(takewhile(lambda x: x != stop and x != "\n".format(stop), a_string))

回答by lancif

Use this regex,

使用这个正则表达式,

re.search("shantaram[\s\S]*?the", string)

instead of

代替

re.search("shantaram[\s\S]*the", string)

The only difference is '?'. By using '?'(e.g. *?, +?), you can prevent longest matching.

唯一的区别是“?”。通过使用'?'(例如*?, +?),您可以防止最长匹配。