python xml.dom.minidom:获取 CDATA 值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/597058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:25:11  来源:igfitidea点击:

xml.dom.minidom: Getting CDATA values

pythonxml

提问by Jason Coon

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

我能够获取图像标签中的值(请参阅下面的 XML),但不能获取 Category 标签中的值。区别在于一个是 CDATA 部分,另一个只是一个字符串。任何帮助,将不胜感激。

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])

回答by bobince

p.getElementsByTagName('Category')[0].firstChild

p.getElementsByTagName('Category')[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

minidom 不会将 <![CDATA[ 部分扁平化为纯文本,而是将它们保留为 DOM CDATASection 节点。(可以说它应该,至少是可选的。DOM Level 3 LS 默认将它们展平,因为它的价值,但 minidom 比 DOM L3 老得多。)

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

因此,Category 的 firstChild 是一个 Text 节点,表示 <Category> 开始标记和 CDATA 部分开头之间的空白。它有两个兄弟节点:CDATASection 节点和另一个尾随空白文本节点。

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

您可能想要的是 Category 的所有子项的文本数据。在 DOM Level 3 Core 中,您只需调用:

p.getElementsByTagName('Category')[0].textContent

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

但是 minidom 还不支持。但是,最近的版本确实支持另一种级别 3 方法,您可以使用它以更迂回的方式执行相同的操作:

p.getElementsByTagName('Category')[0].firstChild.wholeText

回答by ironfroggy

CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:

CDATA 是它自己的节点,所以这里的 Category 元素实际上有三个子节点,一个空白文本节点、CDATA 节点和另一个空白节点。你只是看错了,就是这样。我没有看到任何更明显的查询 CDATA 节点的方法,但您可以像这样将其拉出:

[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]

回答by BBog

I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:

我遇到了类似的问题。我的解决方案类似于 Ironfroggy 的回答,但以更通用的方式实现:

for node in parentNode.childNodes:
        if node.nodeType == 4:
            cdataContent = node.data.strip()

CDATA's node type is 4 (CDATA_SECTION_NODE)

CDATA 的节点类型为 4 ( CDATA_SECTION_NODE)