python 匹配python正则表达式中的多行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2199552/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:03:03  来源:igfitidea点击:

matching multiple line in python regular expression

python

提问by Sreejith Sasidharan

I want to extract the data between <tr>tags from an html page. I used the following code.But i didn't get any result. The html between the <tr>tags is in multiple lines

我想<tr>从 html 页面中提取标签之间的数据。我使用了以下代码。但我没有得到任何结果。<tr>标签之间的html是多行的

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

请建议修复此问题。

回答by SilentGhost

just to clear up the issue. Despite all those links to re.Mit wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S, if you wouldn't try to parse html, of course:

只是为了澄清问题。尽管所有这些链接在re.M这里都不起作用,因为对其解释的简单浏览会揭示。当然re.S,如果您不尝试解析 html,则需要:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

回答by Mark Byers

Don't use regex, use a HTML parser such as BeautifulSoup:

不要使用正则表达式,使用 HTML 解析器,例如BeautifulSoup

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result:

结果:

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags:

如果你只想要内容,没有 tr 标签:

for tr in soup.findAll("tr"):
    print tr.contents

Result:

结果:

bar
qux

Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.

使用 HTML 解析器并不像听起来那么可怕!并且它比将在此处发布的任何正则表达式更可靠地工作。

回答by Ignacio Vazquez-Abrams

Do not use regular expressions to parse HTML. Use an HTML parser such as lxmlor BeautifulSoup.

不要使用正则表达式来解析 HTML。使用 HTML 解析器,例如lxmlBeautifulSoup

回答by ghostdog74

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

或非正则表达式方式,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

回答by Tendayi Mawushe

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

正如其他人所建议的那样,您可以通过允许使用多行匹配来解决您遇到的特定问题re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoupworks great for this!

但是,您正在使用正则表达式解析HTML的危险补丁。改用 XML/HTML 解析器,BeautifulSoup非常适合这个!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")