python xml.dom.minidom：获取 CDATA 值

Question

提问by Jason Coon

I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.

我能够获取图像标签中的值（请参阅下面的 XML），但不能获取 Category 标签中的值。区别在于一个是 CDATA 部分，另一个只是一个字符串。任何帮助，将不胜感激。

from xml.dom import minidom

xml = """<?xml version="1.0" ?>
<ProductData>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471195.jpg
        </Image>
    </ITEM>
    <ITEM Id="0471195">
        <Category>
            <![CDATA[Homogenizers]]>        
        </Category>
        <Image>
            0471196.jpg
        </Image>
    </ITEM>
</ProductData>
"""

bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
    try:
        part_id = p.attributes['Id'].value.strip()
    except(KeyError):
        bad_xml_item_count += 1
        continue
    if not part_id:
        bad_xml_item_count += 1
        continue
    part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
    part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
    print '\t'.join([part_id, part_category, part_image])

Answer 1

回答by bobince

p.getElementsByTagName('Category')[0].firstChild

minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)

minidom 不会将 <![CDATA[ 部分扁平化为纯文本，而是将它们保留为 DOM CDATASection 节点。（可以说它应该，至少是可选的。DOM Level 3 LS 默认将它们展平，因为它的价值，但 minidom 比 DOM L3 老得多。）

So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.

因此，Category 的 firstChild 是一个 Text 节点，表示 <Category> 开始标记和 CDATA 部分开头之间的空白。它有两个兄弟节点：CDATASection 节点和另一个尾随空白文本节点。

What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:

您可能想要的是 Category 的所有子项的文本数据。在 DOM Level 3 Core 中，您只需调用：

p.getElementsByTagName('Category')[0].textContent

but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:

但是 minidom 还不支持。但是，最近的版本确实支持另一种级别 3 方法，您可以使用它以更迂回的方式执行相同的操作：

p.getElementsByTagName('Category')[0].firstChild.wholeText

Answer 2

回答by ironfroggy

CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:

CDATA 是它自己的节点，所以这里的 Category 元素实际上有三个子节点，一个空白文本节点、CDATA 节点和另一个空白节点。你只是看错了，就是这样。我没有看到任何更明显的查询 CDATA 节点的方法，但您可以像这样将其拉出：

[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]

Answer 3

回答by BBog

I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:

我遇到了类似的问题。我的解决方案类似于 Ironfroggy 的回答，但以更通用的方式实现：

for node in parentNode.childNodes:
        if node.nodeType == 4:
            cdataContent = node.data.strip()

CDATA's node type is 4 (CDATA_SECTION_NODE)

CDATA 的节点类型为 4 ( CDATA_SECTION_NODE)

python xml.dom.minidom：获取 CDATA 值

提问by Jason Coon

回答by bobince

回答by ironfroggy

回答by BBog

相关推荐

最近更新

标签

python xml.dom.minidom：获取 CDATA 值

提问by Jason Coon

回答by bobince

回答by ironfroggy

回答by BBog

相关推荐

使用 win32com 和 python 复制和粘贴隔离

python 用微秒解析日期时间字符串

Python 文件索引和搜索

python @classmethod 中的“self”指的是什么？

相关推荐

最近更新

标签