python 使用Python获取html文件上所有<a>标签中href属性的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/671323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting the value of href attributes in all <a> tags on a html file with Python
提问by rogeriopvl
I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
我正在用 python 构建一个应用程序,我需要获取一个网页中所有链接的 URL。我已经有一个函数,它使用 urllib 从网络下载 html 文件,并使用 readlines() 将其转换为字符串列表。
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
目前我有这段代码使用正则表达式(我不太擅长)来搜索每一行中的链接:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
这不起作用,因为它只为文件中的每一行打印“无”,但我确定我打开的文件上至少有 3 个链接。
Can someone give me a hint on this?
有人可以给我一个提示吗?
Thanks in advance
提前致谢
回答by Ignacio Vazquez-Abrams
Beautiful Soupcan do this almost trivially:
Beautiful Soup几乎可以轻松做到这一点:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body><a href="123">qwe</a><a href="456">asd</a></body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
回答by adw
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
BeautifulSoup 的另一个替代方案是 lxml ( http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/@href")
for link in links:
print link
回答by eduffy
There's an HTML parser that comes standard in Python. Checkout htmllib
.
Python 中有一个标准的 HTML 解析器。结帐htmllib
。
回答by GetFree
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
其他人没有告诉您的是,为此使用正则表达式并不是一个可靠的解决方案。
使用正则表达式在很多情况下都会给你错误的结果:如果有被注释掉的 <A> 标签,或者页面中是否有包含字符串“href=”的文本,或者如果有带有字符串的 <textarea> 元素html 代码,以及许多其他代码。另外,href 属性可能存在于锚标记以外的标记上。
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
为此您需要的是XPath,它是一种 DOM 树的查询语言,即它允许您检索满足您指定条件的任何节点集(HTML 属性是 DOM 中的节点)。
XPath 现在是一种很好的标准化语言(W3C),并且得到所有主要语言的良好支持。我强烈建议您为此使用 XPath 而不是正则表达式。
adw 的回答显示了针对您的特定情况使用 XPath 的一个示例。
回答by bobince
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
如前所述:regex 没有解析 HTML 的能力。不要使用正则表达式来解析 HTML。不要通过围棋。不要收集£200。
Use an HTML parser.
使用 HTML 解析器。
But for completeness, the primary problem is:
但为了完整性,主要问题是:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
您不使用“/.../flags”语法来装饰 Python 中的正则表达式。而是将标志放在单独的参数中:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*' pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?' or, more simply, ‘[^"]*' to only match up to the first closing quote.
另一个问题是贪婪的 '.*' 模式。如果一行中有两个 href ,它会很高兴地吸收第一场比赛的开头“和第二场比赛的结尾”之间的所有内容。您可以使用非贪婪的 '.*?' 或者,更简单地说, '[^"]*' 只匹配第一个结束引号。
But don't use regexes for parsing HTML. Really.
但是不要使用正则表达式来解析 HTML。真的。
回答by rogeriopvl
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
好吧,为了完整起见,我将在这里添加我发现的最佳答案,我在 Mark Pilgrim 的 Dive Into Python 一书中找到了它。
Here follows the code to list all URL's from a webpage:
下面是列出网页中所有 URL 的代码:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.
感谢所有的答复。
回答by Jiayao Yu
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
不要将 html 内容分成几行,因为一行中可能有多个匹配项。也不要假设网址周围总是有引号。
Do something like this:
做这样的事情:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link