python xml.dom.minidom:获取 CDATA 值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/597058/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
xml.dom.minidom: Getting CDATA values
提问by Jason Coon
I'm able to get the value in the image tag (see XML below), but not the Category tag. The difference is one is a CDATA section and the other is just a string. Any help would be appreciated.
我能够获取图像标签中的值(请参阅下面的 XML),但不能获取 Category 标签中的值。区别在于一个是 CDATA 部分,另一个只是一个字符串。任何帮助,将不胜感激。
from xml.dom import minidom
xml = """<?xml version="1.0" ?>
<ProductData>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471195.jpg
</Image>
</ITEM>
<ITEM Id="0471195">
<Category>
<![CDATA[Homogenizers]]>
</Category>
<Image>
0471196.jpg
</Image>
</ITEM>
</ProductData>
"""
bad_xml_item_count = 0
data = {}
xml_data = minidom.parseString(xml).getElementsByTagName('ProductData')
parts = xml_data[0].getElementsByTagName('ITEM')
for p in parts:
try:
part_id = p.attributes['Id'].value.strip()
except(KeyError):
bad_xml_item_count += 1
continue
if not part_id:
bad_xml_item_count += 1
continue
part_image = p.getElementsByTagName('Image')[0].firstChild.nodeValue.strip()
part_category = p.getElementsByTagName('Category')[0].firstChild.data.strip()
print '\t'.join([part_id, part_category, part_image])
回答by bobince
p.getElementsByTagName('Category')[0].firstChild
p.getElementsByTagName('Category')[0].firstChild
minidom does not flatten away <![CDATA[ sections to plain text, it leaves them as DOM CDATASection nodes. (Arguably it should, at least optionally. DOM Level 3 LS defaults to flattening them, for what it's worth, but minidom is much older than DOM L3.)
minidom 不会将 <![CDATA[ 部分扁平化为纯文本,而是将它们保留为 DOM CDATASection 节点。(可以说它应该,至少是可选的。DOM Level 3 LS 默认将它们展平,因为它的价值,但 minidom 比 DOM L3 老得多。)
So the firstChild of Category is a Text node representing the whitespace between the <Category> open tag and the start of the CDATA section. It has two siblings: the CDATASection node, and another trailing whitespace Text node.
因此,Category 的 firstChild 是一个 Text 节点,表示 <Category> 开始标记和 CDATA 部分开头之间的空白。它有两个兄弟节点:CDATASection 节点和另一个尾随空白文本节点。
What you probably want is the textual data of all children of Category. In DOM Level 3 Core you'd just call:
您可能想要的是 Category 的所有子项的文本数据。在 DOM Level 3 Core 中,您只需调用:
p.getElementsByTagName('Category')[0].textContent
but minidom doesn't support that yet. Recent versions do, however, support another Level 3 method you can use to do the same thing in a more roundabout way:
但是 minidom 还不支持。但是,最近的版本确实支持另一种级别 3 方法,您可以使用它以更迂回的方式执行相同的操作:
p.getElementsByTagName('Category')[0].firstChild.wholeText
回答by ironfroggy
CDATA is its own node, so the Category elements here actually have three children, a whitespace text node, the CDATA node, and another whitespace node. You're just looking at the wrong one, is all. I don't see any more obvious way to query for the CDATA node, but you can pull it out like this:
CDATA 是它自己的节点,所以这里的 Category 元素实际上有三个子节点,一个空白文本节点、CDATA 节点和另一个空白节点。你只是看错了,就是这样。我没有看到任何更明显的查询 CDATA 节点的方法,但您可以像这样将其拉出:
[n for n in category.childNodes if n.nodeType==category.CDATA_SECTION_NODE][0]
回答by BBog
I've ran into a similar problem. My solution was similar to what ironfroggy answered, but implemented in a more general fashion:
我遇到了类似的问题。我的解决方案类似于 Ironfroggy 的回答,但以更通用的方式实现:
for node in parentNode.childNodes:
if node.nodeType == 4:
cdataContent = node.data.strip()
CDATA's node type is 4 (CDATA_SECTION_NODE
)
CDATA 的节点类型为 4 ( CDATA_SECTION_NODE
)