Python 使用 Lxml 解析 HTML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3569152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:45:38  来源:igfitidea点击:

Parsing HTML with Lxml

pythonhtmlparsinglxml

提问by imns

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

我需要帮助从带有 lxml 的页面中解析出一些文本。我尝试了 beautifulsoup 并且我正在解析的页面的 html 已损坏,无法正常工作。所以我已经转向 lxml,但文档有点混乱,我希望这里有人可以帮助我。

Hereis the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.

是我试图解析的页面,我需要在“附加信息”部分下获取文本。请注意,我在这个网站上有很多这样的页面要解析,并且每个页面的 html 并不总是完全相同(可能包含一些额外的空“td”标签)。任何关于如何获得该文本的建议将不胜感激。

Thanks for the help.

谢谢您的帮助。

采纳答案by unutbu

import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

产量

For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

65 年来,Carl Stirn's Marine 一直在为划船享受制定卓越和服务的新标准。因为我们提供优质的商品,关怀,认真,销售和服务,我们已经能够使我们的客户成为我们的好朋友。

我们 26,000 平方英尺的设施包括一个完整的零件和配件部门、全方位服务部门(Merc. Premier 经销商和 2 名全职 Mercruiser Master Tech's),以及新的、二手的和经纪销售。

Edit:Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

编辑:这是一个基于 Steven D. Majewski 的 xpath 的替代解决方案,它解决了 OP 的评论,即从简介中分离“附加信息”的标签数量可能未知:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))