python 匹配python正则表达式中的多行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2199552/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
matching multiple line in python regular expression
提问by Sreejith Sasidharan
I want to extract the data between <tr>
tags from an html page. I used the following code.But i didn't get any result. The html between the <tr>
tags is in multiple lines
我想<tr>
从 html 页面中提取标签之间的数据。我使用了以下代码。但我没有得到任何结果。<tr>
标签之间的html是多行的
category =re.findall('<tr>(.*?)</tr>',data);
Please suggest a fix for this problem.
请建议修复此问题。
回答by SilentGhost
just to clear up the issue. Despite all those links to re.M
it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S
, if you wouldn't try to parse html, of course:
只是为了澄清问题。尽管所有这些链接在re.M
这里都不起作用,因为对其解释的简单浏览会揭示。当然re.S
,如果您不尝试解析 html,则需要:
>>> doc = """<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>"""
>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n <td>row 1, cell 1</td>\n <td>row 1, cell 2</td>\n ',
'\n <td>row 2, cell 1</td>\n <td>row 2, cell 2</td>\n ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]
回答by Mark Byers
Don't use regex, use a HTML parser such as BeautifulSoup:
不要使用正则表达式,使用 HTML 解析器,例如BeautifulSoup:
html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")
Result:
结果:
[<tr>bar</tr>, <tr>qux</tr>]
If you just want the contents, without the tr tags:
如果你只想要内容,没有 tr 标签:
for tr in soup.findAll("tr"):
print tr.contents
Result:
结果:
bar
qux
Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.
使用 HTML 解析器并不像听起来那么可怕!并且它比将在此处发布的任何正则表达式更可靠地工作。
回答by Ignacio Vazquez-Abrams
Do not use regular expressions to parse HTML. Use an HTML parser such as lxmlor BeautifulSoup.
不要使用正则表达式来解析 HTML。使用 HTML 解析器,例如lxml或BeautifulSoup。
回答by ghostdog74
pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)
Or non regex way,
或非正则表达式方式,
for item in data.split("</tr>"):
if "<tr>" in item:
print item[item.find("<tr>")+len("<tr>"):]
回答by Tendayi Mawushe
As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE
正如其他人所建议的那样,您可以通过允许使用多行匹配来解决您遇到的特定问题re.MULTILINE
However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoupworks great for this!
但是,您正在使用正则表达式解析HTML的危险补丁。改用 XML/HTML 解析器,BeautifulSoup非常适合这个!
doc = """<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")