Python 正则表达式匹配关闭的 HTML 标签

Question

提问by kevin628

I'm working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are not in the list I've been using regular expressions to do it and I've been able to match opening tags and self-closing tags but not closing tags.

我正在编写一个小的 Python 脚本来清理 HTML 文档。它的工作原理是接受要 KEEP 的标签列表，然后解析不在列表中的 HTML 代码垃圾标签但不是关闭标签。

The pattern I've been experimenting with to match closing tags is </(?!a)>. This seems logical to me so why is not working? The (?!a)should match on anything that is NOT an anchor tag (not that the "a" is can be anything-- it's just an example).

我一直在尝试匹配结束标签的模式是</(?!a)>. 这对我来说似乎合乎逻辑，所以为什么不起作用？本(?!a)应匹配任何不是一个锚定标记（不就是“a”是可以anything--这只是一个例子）。

Edit: AGG! I guess the regex didn't show!

编辑：AGG！我猜正则表达式没有显示！

Answer 1

采纳答案by NullUserException

Read:
- RegEx match open tags except XHTML self-contained tags
- Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Repent.
Use a real HTML parser, like BeautifulSoup.

读：
- RegEx 匹配除 XHTML 自包含标签之外的开放标签
- 您能否提供一些示例，说明为什么使用正则表达式很难解析 XML 和 HTML？
悔改。
使用真正的 HTML 解析器，比如BeautifulSoup。

Answer 2

回答by NullUserException

Don't use regex to parse HTML. It will only give you headaches.

不要使用正则表达式来解析 HTML。它只会让你头疼。

Use an XML parser instead. Try BeautifulSoupor lxml.

请改用 XML 解析器。试试BeautifulSoup或lxml。

Answer 3

回答by pavanlimo

<TAG\b[^>]*>(.*?)</TAG>

Matches the opening and closing pair of a specific HTML tag.

匹配特定 HTML 标记的开始和结束对。

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</>

Will match the opening and closing pair of any HTML tag.

将匹配任何 HTML 标签的开始和结束对。

See here.

见这里。

Answer 4

回答by Ama Aje My Fren

You may also consider using the html parser that is built into python (Documentation for Python 2and Python 3)

您还可以考虑使用Python内置的 html 解析器（Python 2和Python 3 的文档）

This will help you home in on the specific area of the HTML Document you would like to work on - and use regular expressions on it.

这将帮助您找到您想要处理的 HTML 文档的特定区域 - 并在其上使用正则表达式。

Python 正则表达式匹配关闭的 HTML 标签

提问by kevin628

采纳答案by NullUserException

回答by NullUserException

回答by pavanlimo

回答by Ama Aje My Fren

相关推荐

最近更新

标签

Python 正则表达式匹配关闭的 HTML 标签

提问by kevin628

采纳答案by NullUserException

回答by NullUserException

回答by pavanlimo

回答by Ama Aje My Fren

相关推荐

Python 如何将 CSV 数据读入 NumPy 中的记录数组？

Python 中的两个正斜杠

Python Regex，re.sub，替换模式的多个部分？

Python web.py - 指定地址和端口

相关推荐

最近更新

标签