使用 XPath 获取 HTML 元素的文本内容?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14631590/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get text content of an HTML element using XPath?
提问by Genghis Khan
See this html
看这个html
<div>
<p>
<span class="abc">Monitor</span> <b>0</b>
</p>
<a href="/add">Add to cart</a>
</div>
<div>
<p>
<span class="abc">Keyboard</span>
</p>
<a href="/add">Add to cart</a>
</div>
Using xpath I want to parse Monitor $300
and Keyboard $20
. I use this xpath
使用 xpath 我想解析Monitor $300
和Keyboard $20
. 我使用这个 xpath
//div[a[contains(., "Add to cart")]]/p/text()
But it selects <span class="abc">Monitor</span> <b>$300</b>
. I don't want the tags. How do I get only the text?
但它选择<span class="abc">Monitor</span> <b>$300</b>
. 我不要标签。如何只获取文本?
回答by Martijn Pieters
You want to select all descendanttext, not just child text:
您想选择所有后代文本,而不仅仅是子文本:
//div[a[contains(., "Add to cart")]]/p//text()
Note the double slash between p
and text()
there.
注意p
和text()
那里之间的双斜线。
This potentially will also include a lot of inter-tag whitespace though, you you'll need to clean that up. Example using lxml
:
这可能还会包括大量的标签间空白,但您需要将其清理干净。使用示例lxml
:
>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
... <p>
... <span class="abc">Monitor</span> <b>0</b>
... </p>
... <a href="/add">Add to cart</a>
... </div>
... <div>
... <p>
... <span class="abc">Keyboard</span>
... </p>
... <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n ', 'Monitor', ' ', '0', '\n ', '\n ', 'Keyboard', ' \n ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '0', 'Keyboard', '']