python 如何使用 html5lib 解析 HTML,并使用 XPath 查询解析后的 H​​TML?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2558056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:55:07  来源:igfitidea点击:

How can I parse HTML with html5lib, and query the parsed HTML with XPath?

pythonparsingxpathlxmlhtml5lib

提问by Dan.StackOverflow

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

我正在尝试使用 html5lib 将 html 页面解析为我可以使用 xpath 查询的内容。html5lib 的文档几乎为零,我花了太多时间试图解决这个问题。最终目标是拉出表的第二行:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

所以让我们尝试一下:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

看起来不错,让我们看看还有什么:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

大声笑?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

严重地。我计划使用一些 xpath 来获取我想要的数据,但这似乎不起作用。那我能做什么?我愿意尝试不同的库和方法。

回答by Ryan Ginstrom

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

缺乏文档是避免图书馆 IMO 的一个很好的理由,无论它有多酷。你坚持使用 html5lib 吗?你看过lxml.html吗?

Here is a way to do this with lxml:

这是一种使用 lxml 执行此操作的方法:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

结果:

['Header', 'Want This']

回答by sciyoshi

What you want to use is the namespaceHTMLElementsargument, which for some reason defaults to True.

您要使用的是namespaceHTMLElements参数,由于某种原因默认为 True。

doc = html5lib.parse('''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)

print lxml.html.tostring(doc)

It's probably still easier to use lxml.html however.

然而,使用 lxml.html 可能仍然更容易。

回答by Ruslan Spivak

I always recommend to try out lxmllibrary. It's blazingly fast and has many features.

我总是建议尝试lxml图书馆。它的速度非常快,并且具有许多功能。

It has also support for html5lib parser if you need that: html5parser

如果需要,它还支持 html5lib 解析器:html5parser

>>> from lxml.html import fromstring, tostring

>>> html = """
... <html>
...     <table>
...         <tr><td>Header</td></tr>
...         <tr><td>Want This</td></tr>
...     </table>
... </html>
... """
>>> doc = fromstring(html)
>>> tr = doc.cssselect('table tr')[1]
>>> print tostring(tr)
<tr><td>Want This</td></tr>

回答by Ismail Badawi

With BeautifulSoup, you can do that with

使用BeautifulSoup,您可以使用

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>')
>>> soup.findAll('td')[1].string
u'Want This'
>>> soup.findAll('tr')[1].td.string
u'Want This'

(Obviously that's a really crude example, but ya.)

(显然,这是一个非常粗略的例子,但是你。)

回答by z33m

i believe you can do css search on lxml objects.. like so

我相信你可以对 lxml 对象进行 css 搜索。

elements = root.cssselect('div.content')
data = elements[0].text

回答by maxschlepzig

Since html5lib (by default) creates trees that contain (correct) namespace information you have specify (the right) namespaces in your queries, as well.

由于 html5lib(默认情况下)创建包含(正确)名称空间信息的树,因此您也在查询中指定(正确)名称空间。

Example with an XPath query:

XPath 查询示例:

import html5lib
inp='''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>'''
xns = '{http://www.w3.org/1999/xhtml}'
d = html5lib.parse(inp)
s = d.findall('.//{}td'.format(xns))[-1].text
print(s)

Output:

输出:

Want This

The same result without XPath:

没有 XPath 的结果相同:

s = d.find(xns+'body').find(xns+'table').find(xns+'tbody') \
     .findall(xns+'tr')[-1].find(xns+'td').text

Alternatively, you can also tell html5lib to avoid adding any namespace information during parsing:

或者,您也可以告诉 html5lib 在解析过程中避免添加任何命名空间信息:

d = html5lib.parse(inp, namespaceHTMLElements=False)
s = d.findall('.//td')[-1].text
print(s)

Output:

输出:

Want This

回答by yamspog

try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.

尝试使用 jquery。并且您可以检索所有元素。或者,您可以在行上放置一个 id 并将其拉出。

1) ... ...

1) ... ...

$("td")[1].innerHTML will be what you want

$("td")[1].innerHTML 就是你想要的

2) ... ...

2) ... ...

$("#blah").text() will be what you want

$("#blah").text() 就是你想要的