Python ElementTree 迭代解析策略

Question

提问by Juan Antonio Gomez Moriano

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse()function (SAX style parsing).

我必须处理足够大（最多 1GB）的 xml 文档并用 python 解析它们。我正在使用iterparse()函数（SAX 样式解析）。

My concern is the following, imagine you have an xml like this

我关心的是以下，想象你有一个这样的 xml

<?xml version="1.0" encoding="UTF-8" ?>
<families>
  <family>
    <name>Simpson</name>
    <members>
        <name>Homer</name>
        <name>Marge</name>
        <name>Bart</name>
    </members>
  </family>
  <family>
    <name>Griffin</name>
    <members>
        <name>Peter</name>
        <name>Brian</name>
        <name>Meg</name>
    </members>
  </family>
</families>

The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)

问题是，当然要知道我何时获得姓氏（如辛普森一家）以及何时获得该家庭成员之一的姓名（例如荷马）

What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this

到目前为止我一直在做的是使用“开关”，它会告诉我我是否在“成员”标签内，代码看起来像这样

import xml.etree.cElementTree as ET

__author__ = 'moriano'

file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))

# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()

    if event == 'start' :
        if tag == "members" :
            on_members_tag = True

        elif tag == 'name' :
            if on_members_tag :
                print "The member of the family is %s" % value
            else :
                print "The family is %s " % value

    if event == 'end' and tag =='members' :
        on_members_tag = False
    elem.clear()

And this works fine as the output is

这工作正常，因为输出是

The family is Simpson 
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin 
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg

My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.

我担心的是，对于这个（简单的）示例，我必须创建一个额外的变量来知道我在哪个标签中（on_members_tag），想象一下我必须处理的真实 xml 示例，它们有更多的嵌套标签。

Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.

另请注意，这是一个非常简化的示例，因此您可以假设我可能面临一个带有更多标签、更多内部标签的 xml，并试图获取不同的标签名称、属性等。

So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.

所以问题是。我在这里做了什么可怕的愚蠢的事情吗？我觉得必须有一个更优雅的解决方案。

Answer 1

采纳答案by nneonneo

Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).

这是一种可能的方法：我们维护一个路径列表并向后查看以找到父节点。

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

Answer 2

回答by Gary van der Merwe

pulldomis excellent for this. You get a sax stream. You can iterate through the stream, and when you find a node that your are interested in, load that node in to a dom fragment.

pulldom非常适合这个。你得到一个萨克斯流。您可以遍历流，当您找到感兴趣的节点时，将该节点加载到 dom 片段中。

import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/

events = pulldom.parse('families.xml')
for event, node in events:
    if event == 'START_ELEMENT' and node.tagName=='family':
        events.expandNode(node) # node now contains a dom fragment
        family_name = xpath.findvalue('name', node)
        members = xpath.findvalues('members/name', node)
        print('family name: {0}, members: {1}'.format(family_name, members))

output:

输出：

family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']

Python ElementTree 迭代解析策略

提问by Juan Antonio Gomez Moriano

采纳答案by nneonneo

回答by Gary van der Merwe

相关推荐

最近更新

标签

Python ElementTree 迭代解析策略

提问by Juan Antonio Gomez Moriano

采纳答案by nneonneo

回答by Gary van der Merwe

相关推荐

Python gunicorn 在源更改时自动重新加载

Python virtualenv：指定在系统范围和本地使用哪些包

Python 如何检查变量是否等于一个字符串或另一个字符串？

__eq__ 在 Python 中是如何处理的以及以什么顺序处理？

相关推荐

最近更新

标签

eq 在 Python 中是如何处理的以及以什么顺序处理？