python 如何使用python获取xml文件中的特定节点

Question

提问by Moayyad Yaghi

im searching for a way to get a specific tags .. from a very big xml document with python dom built in module
for example :

我正在寻找一种方法来获取特定标签 .. 从一个非常大的 xml 文档中使用 python dom 内置模块
，例如：

<AssetType longname="characters" shortname="chr" shortnames="chrs">
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>

<AssetType longname="camera" shortname="cam" shortnames="cams">
  <type>
    cam1
  </type>
  <type>
    cam2
  </type>
  <type>
    cam4
  </type>
</AssetType>

i want to retrieve the value of children of AssetType node who got attribute (longname= "characters" ) to have the result of 'pub','geo','rig'
please put in mind that i have more than 1000 < AssetType> nodes
thanx in advance

我想检索获得属性 (longname="characters" ) 的 AssetType 节点的子节点的值，结果'pub','geo','rig'
请记住，我
提前有超过 1000 个 <AssetType> 节点

Answer 1

采纳答案by eswald

If you don't mind loading the whole document into memory:

如果您不介意将整个文档加载到内存中：

from lxml import etree
data = etree.parse(fname)
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]

You may need to remove the spaces at the beginning of your tags to make this work.

您可能需要删除标签开头的空格才能完成这项工作。

Answer 2

回答by Tendayi Mawushe

Assuming your document is called assets.xmland has the following structure:

假设您的文档被调用assets.xml并具有以下结构：

<assets>
    <AssetType>
        ...
    </AssetType>
    <AssetType>
        ...
    </AssetType>
</assets>

Then you can do the following:

然后您可以执行以下操作：

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
    for type in assetType.getchildren():
        print type.text

Answer 3

回答by John Montgomery

You could use the pulldom APIto handle parsing a large file, without loading it all into memory at once. This provides a more convenient interface than using SAX with only a slight loss of performance.

您可以使用pulldom API来处理大文件的解析，而无需一次性将其全部加载到内存中。这提供了一个比使用 SAX 更方便的接口，并且性能只有轻微的损失。

It basically lets you stream the xml file until you find the bit you are interested in, then start using regular DOM operationsafter that.

它基本上允许您流式传输 xml 文件，直到找到您感兴趣的位，然后开始使用常规 DOM 操作。


from xml.dom import pulldom

# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html
def getInnerText(oNode):
    rc = ""
    nodelist = oNode.childNodes
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
        elif node.nodeType==node.ELEMENT_NODE:
            rc = rc + getInnerText(node)   # recursive !!!
        elif node.nodeType==node.CDATA_SECTION_NODE:
            rc = rc + node.data
        else:
            # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
           pass
    return rc


# xml_file is either a filename or a file
stream = pulldom.parse(xml_file) 
for event, node in stream:
    if event == "START_ELEMENT" and node.nodeName == "AssetType":
        if node.getAttribute("longname") == "characters":
            stream.expandNode(node) # node now contains a mini-dom tree
            type_nodes = node.getElementsByTagName('type')
            for type_node in type_nodes:
                # type_text will have the value of what's inside the type text
                type_text = getInnerText(type_node)

Answer 4

回答by gruszczy

Use xml.saxmodule. Build your own handler and inside startElementyou should check, whether name is AssetType. This way you should be able to only act, when AssetType node is processed.

使用xml.sax模块。构建您自己的处理程序并在startElement 中检查名称是否为 AssetType。这样，您应该只能在处理 AssetType 节点时执行操作。

Hereyou have example handler, which shows, how to build one (though it's not the most pretty way, at that point I didn't know all the cool tricks with Python ;-)).

在这里，您有示例处理程序，它显示了如何构建一个处理程序（尽管这不是最漂亮的方式，在这一点上我不知道 Python 的所有酷技巧；-)）。

Answer 5

回答by ron

You could use xpath, something like "//AssetType[longname='characters']/xyz".

您可以使用 xpath，例如“//AssetType[longname='characters']/xyz”。

For XPath libs in Python see http://www.somebits.com/weblog/tech/python/xpath.html

对于 Python 中的 XPath 库，请参阅http://www.somebits.com/weblog/tech/python/xpath.html

Answer 6

回答by MattH

Similar to eswald's solution, again stripping whitespace, again loading the document into memory, but returning the three text items at a time

类似于eswald的解决方案，再次剥离空格，再次将文档加载到内存中，但一次返回三个文本项

from lxml import etree

data = """<AssetType longname="characters" shortname="chr" shortnames="chrs"
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>
"""

doc = etree.XML(data)

for asset in doc.xpath('//AssetType[@longname="characters"]'):
  threetypes = [ x.strip() for x in asset.xpath('./type/text()') ]
  print threetypes

python 如何使用python获取xml文件中的特定节点

提问by Moayyad Yaghi

采纳答案by eswald

回答by Tendayi Mawushe

回答by John Montgomery

回答by gruszczy

回答by ron

回答by MattH

相关推荐

最近更新

标签

python 如何使用python获取xml文件中的特定节点

提问by Moayyad Yaghi

采纳答案by eswald

回答by Tendayi Mawushe

回答by John Montgomery

回答by gruszczy

回答by ron

回答by MattH

相关推荐

OpenCV 2.0 和 Python

如何在 Python 中为函数添加超时

python 为什么 super(Thread, self).__init__() 不适用于 threading.Thread 子类？

让 python MySQLdb 在 Ubuntu 上运行

相关推荐

最近更新

标签

python 为什么 super(Thread, self).init() 不适用于 threading.Thread 子类？