python 如何使用python获取xml文件中的特定节点

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2230677/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:07:35  来源:igfitidea点击:

how to get specific nodes in xml file with python

pythonxml

提问by Moayyad Yaghi

im searching for a way to get a specific tags .. from a very big xml document with python dom built in module
for example :

我正在寻找一种方法来获取特定标签 .. 从一个非常大的 xml 文档中使用 python dom 内置模块
,例如:

<AssetType longname="characters" shortname="chr" shortnames="chrs">
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>

<AssetType longname="camera" shortname="cam" shortnames="cams">
  <type>
    cam1
  </type>
  <type>
    cam2
  </type>
  <type>
    cam4
  </type>
</AssetType>

i want to retrieve the value of children of AssetType node who got attribute (longname= "characters" ) to have the result of 'pub','geo','rig'
please put in mind that i have more than 1000 < AssetType> nodes
thanx in advance

我想检索获得属性 (longname="characters" ) 的 AssetType 节点的子节点的值,结果'pub','geo','rig'
请记住,我
提前有超过 1000 个 <AssetType> 节点

采纳答案by eswald

If you don't mind loading the whole document into memory:

如果您不介意将整个文档加载到内存中:

from lxml import etree
data = etree.parse(fname)
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]

You may need to remove the spaces at the beginning of your tags to make this work.

您可能需要删除标签开头的空格才能完成这项工作。

回答by Tendayi Mawushe

Assuming your document is called assets.xmland has the following structure:

假设您的文档被调用assets.xml并具有以下结构:

<assets>
    <AssetType>
        ...
    </AssetType>
    <AssetType>
        ...
    </AssetType>
</assets>

Then you can do the following:

然后您可以执行以下操作:

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
    for type in assetType.getchildren():
        print type.text

回答by John Montgomery

You could use the pulldom APIto handle parsing a large file, without loading it all into memory at once. This provides a more convenient interface than using SAX with only a slight loss of performance.

您可以使用pulldom API来处理大文件的解析,而无需一次性将其全部加载到内存中。这提供了一个比使用 SAX 更方便的接口,并且性能只有轻微的损失。

It basically lets you stream the xml file until you find the bit you are interested in, then start using regular DOM operationsafter that.

它基本上允许您流式传输 xml 文件,直到找到您感兴趣的位,然后开始使用常规 DOM 操作


from xml.dom import pulldom

# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html
def getInnerText(oNode):
    rc = ""
    nodelist = oNode.childNodes
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
        elif node.nodeType==node.ELEMENT_NODE:
            rc = rc + getInnerText(node)   # recursive !!!
        elif node.nodeType==node.CDATA_SECTION_NODE:
            rc = rc + node.data
        else:
            # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
           pass
    return rc


# xml_file is either a filename or a file
stream = pulldom.parse(xml_file) 
for event, node in stream:
    if event == "START_ELEMENT" and node.nodeName == "AssetType":
        if node.getAttribute("longname") == "characters":
            stream.expandNode(node) # node now contains a mini-dom tree
            type_nodes = node.getElementsByTagName('type')
            for type_node in type_nodes:
                # type_text will have the value of what's inside the type text
                type_text = getInnerText(type_node)

回答by gruszczy

Use xml.saxmodule. Build your own handler and inside startElementyou should check, whether name is AssetType. This way you should be able to only act, when AssetType node is processed.

使用xml.sax模块。构建您自己的处理程序并在startElement 中检查名称是否为 AssetType。这样,您应该只能在处理 AssetType 节点时执行操作。

Hereyou have example handler, which shows, how to build one (though it's not the most pretty way, at that point I didn't know all the cool tricks with Python ;-)).

在这里,您有示例处理程序,它显示了如何构建一个处理程序(尽管这不是最漂亮的方式,在这一点上我不知道 Python 的所有酷技巧;-))。

回答by ron

You could use xpath, something like "//AssetType[longname='characters']/xyz".

您可以使用 xpath,例如“//AssetType[longname='characters']/xyz”。

For XPath libs in Python see http://www.somebits.com/weblog/tech/python/xpath.html

对于 Python 中的 XPath 库,请参阅http://www.somebits.com/weblog/tech/python/xpath.html

回答by MattH

Similar to eswald's solution, again stripping whitespace, again loading the document into memory, but returning the three text items at a time

类似于eswald的解决方案,再次剥离空格,再次将文档加载到内存中,但一次返回三个文本项

from lxml import etree

data = """<AssetType longname="characters" shortname="chr" shortnames="chrs"
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>
"""

doc = etree.XML(data)

for asset in doc.xpath('//AssetType[@longname="characters"]'):
  threetypes = [ x.strip() for x in asset.xpath('./type/text()') ]
  print threetypes