Python lxml.html 使用 XPath 和变量解析

Question

提问by duenni

I have this HTML snippet

我有这个 HTML 片段

<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>

<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>

Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return

现在我想用 lxml.html 解析它。最后，我想要一个可以提供搜索词（即“一个”）的函数，并且该函数应该返回

One
#link1

For now I'm trying to get a variable in the XPath.

现在我试图在 XPath 中获取一个变量。

Works:

作品：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

print test

Trying with variable. I want to replace the hardcoded 'One'with a variable which I can return to the function later.

尝试使用变量。我想'One'用一个变量替换硬编码，稍后我可以返回到函数。

Doesn't work:

不起作用：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

desiredvars = ['One']
myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)

for each in myresultset: 
        print each

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
IndexError: list index out of range

This is based on this answer: https://stackoverflow.com/a/10688235/2320453Any idea why it doesn't work? Is this the "right way" to do something like this?

这是基于这个答案：https: //stackoverflow.com/a/10688235/2320453知道为什么它不起作用吗？这是做这样的事情的“正确方法”吗？

EDIT:To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:

编辑：总结一下：我想在 a-Tags 中搜索并从此属性中获取文本，但我不想要完整的列表，而是希望能够使用变量进行搜索。伪代码：

import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")

searchterm = 'one'

test=html.xpath("...a/text()=searchterm")

print test

Expected result

预期结果

One
#link1

Answer 1

采纳答案by mata

Your first example woks, but probably not how you think it shoud:

您的第一个示例可以使用，但可能不是您认为的那样：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")

What this returns is a boolean, which will be true if the condition ...='One'is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0]is not valid.

这返回的是一个布尔值，如果...='One'xpath 表达式左侧的结果集中的任何节点的条件为真，则该布尔值将为真。这就是您在第二个示例中收到错误的原因：True[0]无效。

You probalby want all nodes matching the expession, having 'One'as text. The corresponding expression would be:

您可能希望所有与 expession 匹配的节点都具有'One'文本。对应的表达式为：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")

This returns a nodeset as result, or if you just need the url as a string:

这将返回一个节点集作为结果，或者如果您只需要 url 作为字符串：

test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']

Answer 2

回答by symbiotech

I tried mata's response, but for me didn't work:

我尝试了 mata 的回应，但对我来说没有用：

div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]

I found this on their website http://lxml.de/xpathxslt.html#the-xpath-methodfor those that might have the same problem :

我在他们的网站http://lxml.de/xpathxslt.html#the-xpath-method 上为那些可能有同样问题的人找到了这个：

div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]

Python lxml.html 使用 XPath 和变量解析

提问by duenni

采纳答案by mata

回答by symbiotech

相关推荐

最近更新

标签

Python lxml.html 使用 XPath 和变量解析

提问by duenni

采纳答案by mata

回答by symbiotech

相关推荐

使用 Python 合并两个 CSV 文件

Python PyCharm 中的错误未使用导入语句？

Python Pandas：将日期时间列分组为小时和分钟聚合

如何从同一目录导入python类文件？

相关推荐

最近更新

标签