Python 获取lxml中标签内的所有文本

Question

提问by Kevin Burke

I'd like to write a code snippet that would grab all of the text inside the <content>tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren())but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

我想编写一个代码片段<content>，在下面的所有三个实例中，在 lxml 中获取标签内的所有文本，包括代码标签。我试过，tostring(getchildren())但会错过标签之间的文字。我在 API 中搜索相关函数的运气并不好。你能帮我吗？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Answer 1

采纳答案by albertov

Try:

尝试：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

例子：

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

产生： '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Answer 2

回答by Ed Summers

Does text_content()do what you need?

text_content()是否满足您的需求？

Answer 3

回答by d3day

import urllib2
from lxml import etree
url = 'some_url'

getting url

获取网址

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag

获取包含表标签的所有html代码

tree = etree.HTML(page)

xpath selector

xpath 选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

res 是表的 html 代码，这是为我做的工作。

so you can extract the tags content with xpath_text() and tags including their content using tostring()

因此您可以使用 xpath_text() 提取标签内容，并使用 tostring() 提取包括其内容的标签

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

or text = tree.xpath("//content/text()")

或 text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

使用 strip 方法的最后一行并不好，但它只是有效

Answer 4

回答by David

If this is an a tag, you can try:

如果这是一个标签，您可以尝试：

node.values()

Answer 5

回答by Arthur Debert

Just use the node.itertext()method, as in:

只需使用该node.itertext()方法，如：

 ''.join(node.itertext())

Answer 6

回答by bwingenroth

In response to @Richard's comment above, if you patch stringify_children to read:

针对上面@Richard 的评论，如果您修补 stringify_children 以阅读：

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

似乎避免了他提到的重复。

Answer 7

回答by Percival Ulysses

Defining stringify_childrenthis way may be less complicated:

定义stringify_children这种方式可能不那么复杂：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

or in one line

或在一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tailpart of nodein this case isn't interesting since it is "behind" the end tag. Note that the encodingargument may be changed according to one's needs.

基本原理与此答案相同：将子节点的序列化留给 lxml。在这种情况下的tail部分node并不有趣，因为它在结束标记的“后面”。请注意，encoding可以根据需要更改参数。

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

另一种可能的解决方案是序列化节点本身，然后去除开始和结束标记：

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if nodehas no attributes, and I don't think anyone would want to use it even then.

这有点可怕。仅当node没有属性时，此代码才是正确的，我认为即使到那时也没有人愿意使用它。

Answer 8

回答by kazufusa

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Answer 9

回答by anana

A version of albertov 's stringify-contentthat solves the bugsreported by hoju:

解决hoju 报告的错误的 albertov 的stringify-content版本：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Answer 10

回答by Joshmaker

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

我知道这是一个老问题，但这是一个常见问题，我有一个似乎比目前建议的更简单的解决方案：

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

与此问题的其他一些答案不同，此解决方案保留其中包含的所有标签，并从与其他工作解决方案不同的角度解决问题。

Python 获取lxml中标签内的所有文本

提问by Kevin Burke

采纳答案by albertov

回答by Ed Summers

回答by d3day

回答by David

回答by Arthur Debert

回答by bwingenroth

回答by Percival Ulysses

回答by kazufusa

回答by anana

回答by Joshmaker

相关推荐

最近更新

标签

Python 获取lxml中标签内的所有文本

提问by Kevin Burke

采纳答案by albertov

回答by Ed Summers

回答by d3day

回答by David

回答by Arthur Debert

回答by bwingenroth

回答by Percival Ulysses

回答by kazufusa

回答by anana

回答by Joshmaker

相关推荐

Python - urllib2 和 cookielib

如何在 Windows 上运行多个 Python 版本

Python 返回列表中大于某个值的项目列表

Python 在 NumPy 数组中查找等于零的元素的索引

相关推荐

最近更新

标签