Python 遍历 xml 元素的有效方法

Question

提问by nukl

I have a xml like this:

我有一个这样的xml：

<a>
    <b>hello</b>
    <b>world</b>
</a>
<x>
    <y></y>
</x>
<a>
    <b>first</b>
    <b>second</b>
    <b>third</b>
</a>

I need to iterate through all <a>and <b>tags, but I don't know how many of them are in document. So I use xpathto handle that:

我需要遍历所有<a>和<b>标签，但我不知道文档中有多少。所以我xpath用来处理：

from lxml import etree

doc = etree.fromstring(xml)

atags = doc.xpath('//a')
for a in atags:
    btags = a.xpath('b')
    for b in btags:
            print b

It works, but I have pretty big files, and cProfileshows me that xpathis very expensive to use.

它有效，但我有相当大的文件，并向cProfile我展示xpath使用起来非常昂贵。

I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?

我想知道，也许有更有效的方法来无限期地迭代 xml 元素？

Answer 1

采纳答案by unutbu

XPath should be fast. You can reduce the number of XPath calls to one:

XPath 应该很快。您可以将 XPath 调用次数减少到一次：

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text

If that is not fast enough, you could try Liza Daly's fast_iter. This has the advantage of not requiring that the entire XML be processed with etree.fromstringfirst, and parent nodes are thrown away after the children have been visited. Both of these things help reduce the memory requirements. Below is a modified version of fast_iterwhich is more aggressive about removing other elements that are no longer needed.

如果这还不够快，您可以尝试Liza Daly 的 fast_iter。这样做的优点是不需要etree.fromstring首先处理整个 XML ，并且在访问完子节点后将丢弃父节点。这两件事都有助于减少内存需求。下面是一个修改版本，fast_iter它更积极地删除不再需要的其他元素。

def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

Liza Daly's articleon parsing large XML files may prove useful reading to you too. According to the article, lxml with fast_itercan be faster than cElementTree's iterparse. (See Table 1).

Liza Daly关于解析大型 XML 文件的文章也可能对您有用。根据这篇文章，LXML与fast_iter可以比快cElementTree的iterparse。（见表 1）。

Answer 2

回答by user225312

How about iter?

迭代器怎么样？

>>> for tags in root.iter('b'):         # root is the ElementTree object
...     print tags.tag, tags.text
... 
b hello
b world
b first
b second
b third

Answer 3

回答by John Machin

Use iterparse:

使用迭代解析：

   import lxml.etree as ET
   for event, elem in ET.iterparse(filelike_object):
        if elem.tag == "a":
            process_a(elem)
            for child in elem:
                process_child(child)
            elem.clear() # destroy all child elements
        elif elem.tag != "b":
            elem.clear()

Note that this doesn't save all the memory, but I've been able to wade through XML streams of over a Gb using this technique.

请注意，这并不能节省所有内存，但我已经能够使用这种技术处理超过 Gb 的 XML 流。

Try import xml.etree.cElementTree as ET... it comes with Python and its iterparseis faster than the lxml.etreeiterparse, according to the lxml docs:

尝试import xml.etree.cElementTree as ET...根据lxml 文档，它与 Python 一起提供并且它iterparse比更快：lxml.etreeiterparse

"""For applications that require a high parser throughput of large files, and that do little to no serialization, cET is the best choice. Also for iterparse applications that extract small amounts of data or aggregate information from large XML data sets that do not fit into memory. If it comes to round-trip performance, however, lxml tends to be multiple times faster in total. So, whenever the input documents are not considerably larger than the output, lxml is the clear winner."""

"""对于需要大文件的高解析器吞吐量并且很少或不执行序列化的应用程序，cET 是最佳选择。也适用于从大型 XML 数据集中提取少量数据或聚合信息的 iterparse 应用程序适合内存。但是，如果涉及往返性能，lxml 总体上往往要快数倍。因此，只要输入文档不比输出大很多，lxml 就是明显的赢家。"""

Answer 4

回答by Brandon

bs4 is very useful for this

bs4 对此非常有用

from bs4 import BeautifulSoup
raw_xml = open(source_file, 'r')
soup = BeautifulSoup(raw_xml)
soup.find_all('tags')

Python 遍历 xml 元素的有效方法

提问by nukl

采纳答案by unutbu

回答by user225312

回答by John Machin

回答by Brandon

相关推荐

最近更新

标签

Python 遍历 xml 元素的有效方法

提问by nukl

采纳答案by unutbu

回答by user225312

回答by John Machin

回答by Brandon

相关推荐

Python PyAudio IOError：没有可用的默认输入设备

python httplib 名称或服务未知

Python 在 py.test 测试中记录

Python 如何获取两个列表并将它们组合起来排除任何重复项？

相关推荐

最近更新

标签