python html文本中链接的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/430966/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regex for links in html text
提问by Adam Matan
I hope this question is not a RTFM one.
I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href...
tags).
I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?
我希望这个问题不是 RTFM 问题。我正在尝试编写一个从标准 HTML 网页(<link href...
标签)中提取链接的 Python 脚本。我在网上搜索了匹配的正则表达式,发现了许多不同的模式。是否有任何商定的标准正则表达式来匹配链接?
Adam
亚当
UPDATE:I am actually looking for two different answers:
更新:我实际上正在寻找两个不同的答案:
- What's the library solution for parsing HTML links. Beautiful Soupseems to be a good solution (thanks,
Igal Serban
andcletus
!) - Can a link be defined using a regex?
- 解析 HTML 链接的库解决方案是什么。Beautiful Soup似乎是一个很好的解决方案(谢谢,
Igal Serban
还有cletus
!) - 可以使用正则表达式定义链接吗?
回答by cletus
Regexes with HTML get messy. Just use a DOM parser like Beautiful Soup.
带有 HTML 的正则表达式变得混乱。只需使用像 Beautiful Soup 这样的 DOM 解析器。
回答by Triptych
As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:
正如其他人所建议的,如果不需要类似实时的性能,BeautifulSoup 是一个很好的解决方案:
import urllib2
from BeautifulSoup import BeautifulSoup
html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")
As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.
至于第二个问题,是的,HTML 链接应该是明确定义的,但是您实际遇到的 HTML 不太可能是标准的。BeautifulSoup 的美妙之处在于它使用类似浏览器的启发式方法来尝试解析您可能实际遇到的非标准、格式错误的 HTML。
If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.
如果您确定要使用标准 XHTML,则可以使用(快得多)速度更快的 XML 解析器,例如 expat。
Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.
正则表达式,由于上述原因(解析器必须保持状态,而正则表达式不能这样做)永远不会是一个通用的解决方案。
回答by Igal Serban
No there isn't.
不,没有。
You can consider using Beautiful Soup. You can call it the standard for parsing html files.
您可以考虑使用Beautiful Soup。您可以将其称为解析 html 文件的标准。
回答by bobince
Shoudln't a link be a well-defined regex?
链接不应该是定义明确的正则表达式吗?
No, [X]HTML is not in the general case parseable with regex. Consider examples like:
不,[X]HTML 在一般情况下不能用正则表达式解析。考虑以下示例:
<link title='hello">world' href="x">link</link>
<!-- <link href="x">not a link</link> -->
<![CDATA[ ><link href="x">not a link</link> ]]>
<script>document.write('<link href="x">not a link</link>')</script>
and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.
这只是一些随机有效的例子;如果您必须处理现实世界的标签汤 HTML,则有一百万种格式错误的可能性。
If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.
如果您知道并且可以依赖目标页面的确切输出格式,则可以使用正则表达式。否则它是抓取网页的完全错误的选择。
回答by Federico A. Ramponi
Shoudln't a link be a well-defined regex? This is a rather theoretical question,
链接不应该是定义明确的正则表达式吗?这是一个比较理论化的问题,
I second PEZ's answer:
我第二个PEZ的回答:
I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.
我认为 HTML 不适合“定义明确”的正则表达式,因为它不是正则语言。
As far as I know, any HTML tag may contain any number of nested tags. For example:
据我所知,任何 HTML 标签都可能包含任意数量的嵌套标签。例如:
<a href="http://stackoverflow.com">stackoverflow</a>
<a href="http://stackoverflow.com"><i>stackoverflow</i></a>
<a href="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...
Thus, in principle, to match a tag properly you must be able at least to match strings of the form:
因此,原则上,要正确匹配标签,您至少必须能够匹配以下形式的字符串:
BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...
where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the samenumber of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.
其中 B 表示标签的开始,E 表示结束。也就是说,您必须能够匹配由任意数量的 B 和相同数量的 E组成的字符串。为此,您的匹配器必须能够“计数”,而正则表达式(即有限状态自动机)根本无法做到这一点(为了计数,自动机至少需要一个堆栈)。参考 PEZ 的回答,HTML 是一种上下文无关的语法,而不是常规语言。
回答by PEZ
It depends a bit on how the HTML is produced. If it's somewhat controlled you can get away with:
这在一定程度上取决于 HTML 的生成方式。如果它有点受控制,你可以逃脱:
re.findall(r'''<link\s+.*?href=['"](.*?)['"].*?(?:</link|/)>''', html, re.I)
回答by PEZ
Answering your two subquestions there.
在那里回答你的两个子问题。
- I've sometimes subclassed SGMLParser (included in the core Python distribution) and must say it's straight forward.
- I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.
- 我有时会子类化 SGMLParser(包含在核心 Python 发行版中)并且必须说它很简单。
- 我认为 HTML 不适合“定义明确”的正则表达式,因为它不是正则语言。
回答by JaredPar
In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.
在回答问题#2(链接不应该是定义明确的正则表达式)时,答案是......不。
An HTML link structure is a recursive much like parens and braces in programming languages. There must be an equal number of start and end constructs and the "link" expression can be nested within itself.
HTML 链接结构是递归的,很像编程语言中的括号和大括号。必须有相同数量的开始和结束构造,并且“链接”表达式可以嵌套在其自身内。
To properly match a "link" expression a regex would be required to count the start and end tags. Regular expressions are a class of Finite Automata. By definition a Finite Automata cannot "count" constructs within a pattern. A grammar is required to describe a recursive data structure such as this. The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.
为了正确匹配“链接”表达式,需要一个正则表达式来计算开始和结束标签。正则表达式是一类有限自动机。根据定义,有限自动机不能“计算”模式中的构造。需要语法来描述这样的递归数据结构。正则表达式无法“计数”就是为什么您会看到用语法而不是正则表达式描述的编程语言。
So it is not possible to create a regex that will positively match 100% of all "link" expressions. There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect.
因此,不可能创建一个 100% 正匹配所有“链接”表达式的正则表达式。肯定有正则表达式会以高度的准确性匹配大量“链接”,但它们永远不会是完美的。
I wrote a blog article about this problem recently. Regular Expression Limitations
我最近写了一篇关于这个问题的博客文章。 正则表达式限制