如何使用 lxml、XPath 和 Python 从网页中提取链接？

Question

提问by torger

I've got this xpath query:

我有这个 xpath 查询：

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the hrefin FireFox's Xpath checker add-on.

它提取所有带有 title 属性的链接 - 并href在FireFox 的 Xpath checker add-on 中提供。

However, I cannot seem to use it with lxml.

但是，我似乎无法将它与lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml(empty list).

这不会从lxml（空列表）产生任何结果。

How would one grab the hreftext (link) of a hyperlink containing the attribute title with lxmlunder Python?

如何在 Python 下获取href包含属性标题的超链接的文本（链接）lxml？

Answer 1

回答by jkp

I was able to make it work with the following code:

我能够使用以下代码使其工作：

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

Answer 2

回答by mrmagooey

Firefox adds additional html tagsto the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).

Firefox在渲染时向 html添加了额外的 html 标签，使得 firebug 工具返回的 xpath 与服务器返回的实际 html 不一致（以及 urllib/2 将返回的内容）。

Removing the <tbody>tag generally does the trick.

删除<tbody>标签通常可以解决问题。

如何使用 lxml、XPath 和 Python 从网页中提取链接？

提问by torger

回答by jkp

回答by mrmagooey

相关推荐

最近更新

标签

如何使用 lxml、XPath 和 Python 从网页中提取链接？

提问by torger

回答by jkp

回答by mrmagooey

相关推荐

如何使用 Python 和 Google 的协议缓冲区反序列化通过 TCP 发送的数据

python Django 无法识别 MEDIA_URL 路径？

python Django 一对多模型

python 使用 win32com 和/或 active_directory，如何按名称访问电子邮件文件夹？

相关推荐

最近更新

标签