Python lxml.html 使用 XPath 和变量解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16285816/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
lxml.html parsing with XPath and variables
提问by duenni
I have this HTML snippet
我有这个 HTML 片段
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li"><a href="#section">#</a></div>
<ul class="toc">
<li class="level2"><div class="li"><a href="#link1">One</a></div></li>
<li class="level2"><div class="li"><a href="#link2">Two</a></div></li>
<li class="level2"><div class="li"><a href="#link3">Three</a></div></li>
Now I want to parse it with lxml.html. In the end I want a function where I can provide a searchterm (i.e. "one") and the function should return
现在我想用 lxml.html 解析它。最后,我想要一个可以提供搜索词(即“一个”)的函数,并且该函数应该返回
One
#link1
For now I'm trying to get a variable in the XPath.
现在我试图在 XPath 中获取一个变量。
Works:
作品:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
print test
Trying with variable. I want to replace the hardcoded 'One'with a variable which I can return to the function later.
尝试使用变量。我想'One'用一个变量替换硬编码,稍后我可以返回到函数。
Doesn't work:
不起作用:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
desiredvars = ['One']
myresultset=((var, html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='%s']"%(var))[0]) for var in desiredvars)
for each in myresultset:
print each
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <genexpr>
IndexError: list index out of range
This is based on this answer: https://stackoverflow.com/a/10688235/2320453Any idea why it doesn't work? Is this the "right way" to do something like this?
这是基于这个答案:https: //stackoverflow.com/a/10688235/2320453知道为什么它不起作用吗?这是做这样的事情的“正确方法”吗?
EDIT:To sum things up: I want to search within the a-Tags and get the text from this Attributes, but I don't want a complete list instead I want to be able to search with a variable. Pseudo-code:
编辑:总结一下:我想在 a-Tags 中搜索并从此属性中获取文本,但我不想要完整的列表,而是希望能够使用变量进行搜索。伪代码:
import lxml.html
html = lxml.html.parse("www.myurl.com/slash/something")
searchterm = 'one'
test=html.xpath("...a/text()=searchterm")
print test
Expected result
预期结果
One
#link1
采纳答案by mata
Your first example woks, but probably not how you think it shoud:
您的第一个示例可以使用,但可能不是您认为的那样:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a/text()='One'")
What this returns is a boolean, which will be true if the condition ...='One'is true for any of the nodes in the result set at the left side of the xpath expression. And that's why you get the error in your second example: True[0]is not valid.
这返回的是一个布尔值,如果...='One'xpath 表达式左侧的结果集中的任何节点的条件为真,则该布尔值将为真。这就是您在第二个示例中收到错误的原因:True[0]无效。
You probalby want all nodes matching the expession, having 'One'as text. The corresponding expression would be:
您可能希望所有与 expession 匹配的节点都具有'One'文本。对应的表达式为:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']")
This returns a nodeset as result, or if you just need the url as a string:
这将返回一个节点集作为结果,或者如果您只需要 url 作为字符串:
test=html.xpath("//ul[@class='toc']/li[@class='level2']/div[@class='li']/a[text()='One']/@href")
# returns: ['#link1']
回答by symbiotech
I tried mata's response, but for me didn't work:
我尝试了 mata 的回应,但对我来说没有用:
div_name = 'foo'
my_div = x.xpath(".//div[@id=%s]" %div_name)[0]
I found this on their website http://lxml.de/xpathxslt.html#the-xpath-methodfor those that might have the same problem :
我在他们的网站http://lxml.de/xpathxslt.html#the-xpath-method 上为那些可能有同样问题的人找到了这个:
div_name = 'foo'
my_div = x.xpath(".//div[@id=$name]", name=div_name)[0]

