Python 遍历 xml 元素的有效方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4695826/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficient way to iterate through xml elements
提问by nukl
I have a xml like this:
我有一个这样的xml:
<a>
<b>hello</b>
<b>world</b>
</a>
<x>
<y></y>
</x>
<a>
<b>first</b>
<b>second</b>
<b>third</b>
</a>
I need to iterate through all <a>and <b>tags, but I don't know how many of them are in document. So I use xpathto handle that:
我需要遍历所有<a>和<b>标签,但我不知道文档中有多少。所以我xpath用来处理:
from lxml import etree
doc = etree.fromstring(xml)
atags = doc.xpath('//a')
for a in atags:
btags = a.xpath('b')
for b in btags:
print b
It works, but I have pretty big files, and cProfileshows me that xpathis very expensive to use.
它有效,但我有相当大的文件,并向cProfile我展示xpath使用起来非常昂贵。
I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?
我想知道,也许有更有效的方法来无限期地迭代 xml 元素?
采纳答案by unutbu
XPath should be fast. You can reduce the number of XPath calls to one:
XPath 应该很快。您可以将 XPath 调用次数减少到一次:
doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
print b.text
If that is not fast enough, you could try Liza Daly's fast_iter. This has the advantage of not requiring that the entire XML be processed with etree.fromstringfirst, and parent nodes are thrown away after the children have been visited. Both of these things help reduce the memory requirements. Below is a modified version of fast_iterwhich is more aggressive about removing other elements that are no longer needed.
如果这还不够快,您可以尝试Liza Daly 的 fast_iter。这样做的优点是不需要etree.fromstring首先处理整个 XML ,并且在访问完子节点后将丢弃父节点。这两件事都有助于减少内存需求。下面是一个修改版本,fast_iter它更积极地删除不再需要的其他元素。
def fast_iter(context, func, *args, **kwargs):
"""
fast_iter is useful if you need to free memory while iterating through a
very large XML file.
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elt):
print(elt.text)
context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)
Liza Daly's articleon parsing large XML files may prove useful reading to you too. According to the article, lxml with fast_itercan be faster than cElementTree's iterparse. (See Table 1).
Liza Daly关于解析大型 XML 文件的文章也可能对您有用。根据这篇文章,LXML与fast_iter可以比快cElementTree的iterparse。(见表 1)。
回答by user225312
回答by John Machin
Use iterparse:
使用迭代解析:
import lxml.etree as ET
for event, elem in ET.iterparse(filelike_object):
if elem.tag == "a":
process_a(elem)
for child in elem:
process_child(child)
elem.clear() # destroy all child elements
elif elem.tag != "b":
elem.clear()
Note that this doesn't save all the memory, but I've been able to wade through XML streams of over a Gb using this technique.
请注意,这并不能节省所有内存,但我已经能够使用这种技术处理超过 Gb 的 XML 流。
Try import xml.etree.cElementTree as ET... it comes with Python and its iterparseis faster than the lxml.etreeiterparse, according to the lxml docs:
尝试import xml.etree.cElementTree as ET...根据lxml 文档,它与 Python 一起提供并且它iterparse比 更快:lxml.etreeiterparse
"""For applications that require a high parser throughput of large files, and that do little to no serialization, cET is the best choice. Also for iterparse applications that extract small amounts of data or aggregate information from large XML data sets that do not fit into memory. If it comes to round-trip performance, however, lxml tends to be multiple times faster in total. So, whenever the input documents are not considerably larger than the output, lxml is the clear winner."""
"""对于需要大文件的高解析器吞吐量并且很少或不执行序列化的应用程序,cET 是最佳选择。也适用于从大型 XML 数据集中提取少量数据或聚合信息的 iterparse 应用程序适合内存。但是,如果涉及往返性能,lxml 总体上往往要快数倍。因此,只要输入文档不比输出大很多,lxml 就是明显的赢家。"""
回答by Brandon
bs4 is very useful for this
bs4 对此非常有用
from bs4 import BeautifulSoup
raw_xml = open(source_file, 'r')
soup = BeautifulSoup(raw_xml)
soup.find_all('tags')

