Python 正则表达式匹配关闭的 HTML 标签

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3524364/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:34:02  来源:igfitidea点击:

Regular expression to match closing HTML tags

pythonhtmlregex

提问by kevin628

I'm working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are not in the list I've been using regular expressions to do it and I've been able to match opening tags and self-closing tags but not closing tags.

我正在编写一个小的 Python 脚本来清理 HTML 文档。它的工作原理是接受要 KEEP 的标签列表,然后解析不在列表中的 HTML 代码垃圾标签但不是关闭标签。

The pattern I've been experimenting with to match closing tags is </(?!a)>. This seems logical to me so why is not working? The (?!a)should match on anything that is NOT an anchor tag (not that the "a" is can be anything-- it's just an example).

我一直在尝试匹配结束标签的模式是</(?!a)>. 这对我来说似乎合乎逻辑,所以为什么不起作用?本(?!a)应匹配任何不是一个锚定标记(不就是“a”是可以anything--这只是一个例子)。

Edit: AGG! I guess the regex didn't show!

编辑:AGG!我猜正则表达式没有显示!

回答by NullUserException

Don't use regex to parse HTML. It will only give you headaches.

不要使用正则表达式来解析 HTML。它只会让你头疼。

Use an XML parser instead. Try BeautifulSoupor lxml.

请改用 XML 解析器。试试BeautifulSouplxml

回答by pavanlimo

<TAG\b[^>]*>(.*?)</TAG> 

Matches the opening and closing pair of a specific HTML tag.

匹配特定 HTML 标记的开始和结束对。

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</>

Will match the opening and closing pair of any HTML tag.

将匹配任何 HTML 标签的开始和结束对。

See here.

这里

回答by Ama Aje My Fren

You may also consider using the html parser that is built into python (Documentation for Python 2and Python 3)

您还可以考虑使用Python内置的 html 解析器(Python 2Python 3 的文档)

This will help you home in on the specific area of the HTML Document you would like to work on - and use regular expressions on it.

这将帮助您找到您想要处理的 HTML 文档的特定区域 - 并在其上使用正则表达式。