python 匹配python正则表达式中的多行

Question

提问by Sreejith Sasidharan

I want to extract the data between <tr>tags from an html page. I used the following code.But i didn't get any result. The html between the <tr>tags is in multiple lines

我想<tr>从 html 页面中提取标签之间的数据。我使用了以下代码。但我没有得到任何结果。<tr>标签之间的html是多行的

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

请建议修复此问题。

Answer 1

回答by SilentGhost

just to clear up the issue. Despite all those links to re.Mit wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S, if you wouldn't try to parse html, of course:

只是为了澄清问题。尽管所有这些链接在re.M这里都不起作用，因为对其解释的简单浏览会揭示。当然re.S，如果您不尝试解析 html，则需要：

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Answer 2

回答by Mark Byers

Don't use regex, use a HTML parser such as BeautifulSoup:

不要使用正则表达式，使用 HTML 解析器，例如BeautifulSoup：

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result:

结果：

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags:

如果你只想要内容，没有 tr 标签：

for tr in soup.findAll("tr"):
    print tr.contents

Result:

结果：

bar
qux

Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.

使用 HTML 解析器并不像听起来那么可怕！并且它比将在此处发布的任何正则表达式更可靠地工作。

Answer 3

回答by Ignacio Vazquez-Abrams

Do not use regular expressions to parse HTML. Use an HTML parser such as lxmlor BeautifulSoup.

不要使用正则表达式来解析 HTML。使用 HTML 解析器，例如lxml或BeautifulSoup。

Answer 4

回答by ghostdog74

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

或非正则表达式方式，

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

Answer 5

回答by Tendayi Mawushe

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

正如其他人所建议的那样，您可以通过允许使用多行匹配来解决您遇到的特定问题re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoupworks great for this!

但是，您正在使用正则表达式解析HTML的危险补丁。改用 XML/HTML 解析器，BeautifulSoup非常适合这个！

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

python 匹配python正则表达式中的多行

提问by Sreejith Sasidharan

回答by SilentGhost

回答by Mark Byers

回答by Ignacio Vazquez-Abrams

回答by ghostdog74

回答by Tendayi Mawushe

相关推荐

最近更新

标签

python 匹配python正则表达式中的多行

提问by Sreejith Sasidharan

回答by SilentGhost

回答by Mark Byers

回答by Ignacio Vazquez-Abrams

回答by ghostdog74

回答by Tendayi Mawushe

相关推荐

python 将极坐标重新投影到笛卡尔网格

python 替换python中字符串的一个字符

python 在python中从字符串列表列表转换为整数列表列表

python 如何检查对象是否是命名元组的实例？

相关推荐

最近更新

标签