如何使用 lxml、XPath 和 Python 从网页中提取链接?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2084670/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:44:06  来源:igfitidea点击:

How to extract links from a webpage using lxml, XPath and Python?

pythonscreen-scrapinghyperlinklxmlextraction

提问by torger

I've got this xpath query:

我有这个 xpath 查询:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the hrefin FireFox's Xpath checker add-on.

它提取所有带有 title 属性的链接 - 并hrefFireFox 的 Xpath checker add-on 中提供

However, I cannot seem to use it with lxml.

但是,我似乎无法将它与lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml(empty list).

这不会从lxml(空列表)产生任何结果。

How would one grab the hreftext (link) of a hyperlink containing the attribute title with lxmlunder Python?

如何在 Python 下获取href包含属性标题的超链接的文本(链接)lxml

回答by jkp

I was able to make it work with the following code:

我能够使用以下代码使其工作:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

回答by mrmagooey

Firefox adds additional html tagsto the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).

Firefox在渲染时向 html添加了额外的 html 标签,使得 firebug 工具返回的 xpath 与服务器返回的实际 html 不一致(以及 urllib/2 将返回的内容)。

Removing the <tbody>tag generally does the trick.

删除<tbody>标签通常可以解决问题。