python 如何使用python获取xml文件中的特定节点
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2230677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to get specific nodes in xml file with python
提问by Moayyad Yaghi
im searching for a way to get a specific tags .. from a very big xml document
with python dom built in module
for example :
我正在寻找一种方法来获取特定标签 .. 从一个非常大的 xml 文档中使用 python dom 内置模块
,例如:
<AssetType longname="characters" shortname="chr" shortnames="chrs">
<type>
pub
</type>
<type>
geo
</type>
<type>
rig
</type>
</AssetType>
<AssetType longname="camera" shortname="cam" shortnames="cams">
<type>
cam1
</type>
<type>
cam2
</type>
<type>
cam4
</type>
</AssetType>
i want to retrieve the value of children of AssetType node who got attribute (longname= "characters" )
to have the result of 'pub','geo','rig'
please put in mind that i have more than 1000 < AssetType> nodes
thanx in advance
我想检索获得属性 (longname="characters" ) 的 AssetType 节点的子节点的值,结果'pub','geo','rig'
请记住,我
提前有超过 1000 个 <AssetType> 节点
采纳答案by eswald
If you don't mind loading the whole document into memory:
如果您不介意将整个文档加载到内存中:
from lxml import etree
data = etree.parse(fname)
result = [node.text.strip()
for node in data.xpath("//AssetType[@longname='characters']/type")]
You may need to remove the spaces at the beginning of your tags to make this work.
您可能需要删除标签开头的空格才能完成这项工作。
回答by Tendayi Mawushe
Assuming your document is called assets.xml
and has the following structure:
假设您的文档被调用assets.xml
并具有以下结构:
<assets>
<AssetType>
...
</AssetType>
<AssetType>
...
</AssetType>
</assets>
Then you can do the following:
然后您可以执行以下操作:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
for type in assetType.getchildren():
print type.text
回答by John Montgomery
You could use the pulldom APIto handle parsing a large file, without loading it all into memory at once. This provides a more convenient interface than using SAX with only a slight loss of performance.
您可以使用pulldom API来处理大文件的解析,而无需一次性将其全部加载到内存中。这提供了一个比使用 SAX 更方便的接口,并且性能只有轻微的损失。
It basically lets you stream the xml file until you find the bit you are interested in, then start using regular DOM operationsafter that.
它基本上允许您流式传输 xml 文件,直到找到您感兴趣的位,然后开始使用常规 DOM 操作。
from xml.dom import pulldom
# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html
def getInnerText(oNode):
rc = ""
nodelist = oNode.childNodes
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
elif node.nodeType==node.ELEMENT_NODE:
rc = rc + getInnerText(node) # recursive !!!
elif node.nodeType==node.CDATA_SECTION_NODE:
rc = rc + node.data
else:
# node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
pass
return rc
# xml_file is either a filename or a file
stream = pulldom.parse(xml_file)
for event, node in stream:
if event == "START_ELEMENT" and node.nodeName == "AssetType":
if node.getAttribute("longname") == "characters":
stream.expandNode(node) # node now contains a mini-dom tree
type_nodes = node.getElementsByTagName('type')
for type_node in type_nodes:
# type_text will have the value of what's inside the type text
type_text = getInnerText(type_node)
回答by gruszczy
Use xml.saxmodule. Build your own handler and inside startElementyou should check, whether name is AssetType. This way you should be able to only act, when AssetType node is processed.
使用xml.sax模块。构建您自己的处理程序并在startElement 中检查名称是否为 AssetType。这样,您应该只能在处理 AssetType 节点时执行操作。
Hereyou have example handler, which shows, how to build one (though it's not the most pretty way, at that point I didn't know all the cool tricks with Python ;-)).
在这里,您有示例处理程序,它显示了如何构建一个处理程序(尽管这不是最漂亮的方式,在这一点上我不知道 Python 的所有酷技巧;-))。
回答by ron
You could use xpath, something like "//AssetType[longname='characters']/xyz".
您可以使用 xpath,例如“//AssetType[longname='characters']/xyz”。
For XPath libs in Python see http://www.somebits.com/weblog/tech/python/xpath.html
对于 Python 中的 XPath 库,请参阅http://www.somebits.com/weblog/tech/python/xpath.html
回答by MattH
Similar to eswald's solution, again stripping whitespace, again loading the document into memory, but returning the three text items at a time
类似于eswald的解决方案,再次剥离空格,再次将文档加载到内存中,但一次返回三个文本项
from lxml import etree
data = """<AssetType longname="characters" shortname="chr" shortnames="chrs"
<type>
pub
</type>
<type>
geo
</type>
<type>
rig
</type>
</AssetType>
"""
doc = etree.XML(data)
for asset in doc.xpath('//AssetType[@longname="characters"]'):
threetypes = [ x.strip() for x in asset.xpath('./type/text()') ]
print threetypes