如何使用 lxml、XPath 和 Python 从网页中提取链接?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2084670/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract links from a webpage using lxml, XPath and Python?
提问by torger
I've got this xpath query:
我有这个 xpath 查询:
/html/body//tbody/tr[*]/td[*]/a[@title]/@href
It extracts all the links with the title attribute - and gives the href
in FireFox's Xpath checker add-on.
它提取所有带有 title 属性的链接 - 并href
在FireFox 的 Xpath checker add-on 中提供。
However, I cannot seem to use it with lxml
.
但是,我似乎无法将它与lxml
.
from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.
# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href")
for x in hyperlinks:
print x # Print links in <a> tags, containing the title attribute
This produces no result from lxml
(empty list).
这不会从lxml
(空列表)产生任何结果。
How would one grab the href
text (link) of a hyperlink containing the attribute title with lxml
under Python?
如何在 Python 下获取href
包含属性标题的超链接的文本(链接)lxml
?
回答by jkp
I was able to make it work with the following code:
我能够使用以下代码使其工作:
from lxml import html, etree
from StringIO import StringIO
html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head/>
<body>
<table border="1">
<tbody>
<tr>
<td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
</tr>
<tr>
<td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
</tr>
</tbody>
</table>
</body>
</html>'''
tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')
>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']
回答by mrmagooey
Firefox adds additional html tagsto the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).
Firefox在渲染时向 html添加了额外的 html 标签,使得 firebug 工具返回的 xpath 与服务器返回的实际 html 不一致(以及 urllib/2 将返回的内容)。
Removing the <tbody>
tag generally does the trick.
删除<tbody>
标签通常可以解决问题。