如何使用 Python 从 XML 中删除元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3593204/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:52:28  来源:igfitidea点击:

How to remove elements from XML using Python

pythonxml

提问by dwich

I got stuck with XML and Python. The task is simple but I couldn't resolve it so far and spent on that long time. I came here for an advice how to solve it with couple of lines.

我被 XML 和 Python 困住了。任务很简单,但到目前为止我无法解决它并花了很长时间。我来这里是为了建议如何用几行来解决它。

Thanks for any help with traversing the tree. I always ended up with too many or too few elements. Elements can be nested without limit. Given example is just an example. I will accept any solution, not picky about dom, minidom, sax, whatever..

感谢您对遍历树的任何帮助。我总是以太多或太少的元素结束。元素可以无限制地嵌套。给出的例子只是一个例子。我会接受任何解决方案,不挑剔 dom、minidom、sax 等等。

I have an XML file similar to this one:

我有一个类似于这个的 XML 文件:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="en">Content EN</elm>
            <elm lang="cs">?lu?ou?ky koní?ek</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="en">Content EN</elm>
        <elm lang="cs">Content CS</elm>

        <elm lang="en">
            <elm>Content EN</elm>
            <elm>Content EN</elm>
        </elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

What I need - parse the XML and write a new file. The new file should contain all the elements for given language and elements without langattribute.

我需要什么 - 解析 XML 并编写一个新文件。新文件应该包含给定语言的所有元素和没有lang属性的元素。

For "cs" language the output file should containt this:

对于“cs”语言,输出文件应包含:

<root>
    <elm>
        <elm>Common content</elm>

        <elm xmlns="http://example.org/ns">
            <elm lang="cs">?lu?ou?ky koní?ek</elm>
        </elm>

        <elm xml:id="abc123">Common content</elm>

        <elm lang="cs">Content CS</elm>

        <elm lang="cs">
            <elm>Content CS</elm>
            <elm>Content CS</elm>
        </elm>
    </elm>
</root>

If you can make it to omit the langattribute in the new file, even better. But it's not that important.

如果你能让它省略lang新文件中的属性,那就更好了。但这并不重要。

UPDATE1:Added unicode characters and namespace attribute.

UPDATE1:添加了 unicode 字符和命名空间属性。

UPDATE2:Using Python 2.5, standard libraries preferred.

UPDATE2:使用 Python 2.5,首选标准库。

采纳答案by unutbu

Using lxml:

使用lxml

import lxml.etree as le

with open('doc.xml','r') as f:
    doc=le.parse(f)
    for elem in doc.xpath('//*[attribute::lang]'):
        if elem.attrib['lang']=='en':
            elem.attrib.pop('lang')
        else:
            parent=elem.getparent()
            parent.remove(elem)
    print(le.tostring(doc))

yields

产量

<root>
    <elm>Common content</elm>

    <elm>
        <elm>Content EN</elm>
        </elm>

    <elm>Common content</elm>

    <elm>Content EN</elm>
    <elm>
        <elm>Content EN</elm>
        <elm>Content EN</elm>
    </elm>

    </root>

回答by Alex Martelli

I'm not sure how best to remove the langattribute, but here's some code that does the other changes (Python 2.7; for 2.5 or 2.6, use getIteratorinstead of iter), assuming that when you remove an element you also always want to remove everything contained in that element.

我不确定如何最好地删除该lang属性,但这里有一些代码可以进行其他更改(Python 2.7;对于 2.5 或 2.6,使用getIterator代替iter),假设当您删除一个元素时,您也总是想删除包含的所有内容在那个元素中。

This code just prints the result to standard output (you could redirect it as you wish, of course, or directly write it to some new file, and so on):

这段代码只是将结果打印到标准输出(当然,您可以根据需要重定向它,或者直接将其写入某个新文件,等等):

import sys
from xml.etree import cElementTree as et

def picklang(path, lang='en'):
    tr = et.parse(path)
    for element in tr.iter():
        for subelement in element:
            la = subelement.get('lang')
            if la is not None and la != lang:
                element.remove(subelement)
    return tr

if __name__ == '__main__':
    tr = picklang('la.xml')
    tr.write(sys.stdout)
    print

With la.xmlbeing your example, this writes

随着la.xml是你的榜样,这写

<root>
    <elm>Common content</elm>

    <elm>
        <elm lang="en">Content EN</elm>
        </elm>

    <elm>Common content</elm>

    <elm lang="en">Content EN</elm>
    <elm lang="en">
        <elm>Content EN</elm>
        <elm>Content EN</elm>
    </elm>

    </root>

回答by bhuvi

updating @Alex Martelli's code to remove a bug where the element list is updated in place. Above solution will give wrong answer if the input is little more complex.

更新@Alex Martelli 的代码以删除元素列表就地更新的错误。如果输入稍微复杂一点,上述解决方案将给出错误的答案。

import sys
from xml.etree import cElementTree as et

def picklang(path, lang='en'):
    tr = et.parse(path)
    for element in tr.iter():
        for subelement in element[:]:
            la = subelement.get('lang')

            if la is not None and la != lang:
                element.remove(subelement)
    return tr

if __name__ == '__main__':
    tr = picklang('la.xml')
    tr.write(sys.stdout)
    print

Code in line 7 for subelement in element:is changed to for subelement in element[:]:as it is incorrect to update list in place while iterating over it.

第 7 行中的代码for subelement in element:更改为,for subelement in element[:]:因为在迭代列表时就地更新列表是不正确的。

This code iterates over a copy of element list and removes elements when lang != "en" in the original element list.

此代码迭代元素列表的副本,并在原始元素列表中 lang != "en" 时删除元素。