Python 删除 XML 字符串中的空格

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3310614/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:24:16  来源:igfitidea点击:

Remove whitespaces in XML string

pythonxmlxml-serializationpython-2.6elementtree

提问by desolat

How can I remove the whitespaces and line breaks in an XML string in Python 2.6? I tried the following packages:

如何在 Python 2.6 中删除 XML 字符串中的空格和换行符?我尝试了以下软件包:

etree: This snippet keeps the original whitespaces:

etree:此代码段保留原始空格:

xmlStr = '''<root>
    <head></head>
    <content></content>
</root>'''

xmlElement = xml.etree.ElementTree.XML(xmlStr)
xmlStr = xml.etree.ElementTree.tostring(xmlElement, 'UTF-8')
print xmlStr

I can not use Python 2.7 which would provide the methodparameter.

我不能使用提供method参数的Python 2.7 。

minidom: just the same:

minidom:一样的:

xmlDocument = xml.dom.minidom.parseString(xmlStr)
xmlStr = xmlDocument.toprettyxml(indent='', newl='', encoding='UTF-8')

采纳答案by Steven

The easiest solution is probably using lxml, where you can set a parser option to ignore white space between elements:

最简单的解决方案可能是使用lxml,您可以在其中设置解析器选项以忽略元素之间的空格:

>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> xml_str = '''<root>
>>>     <head></head>
>>>     <content></content>
>>> </root>'''
>>> elem = etree.XML(xml_str, parser=parser)
>>> print etree.tostring(elem)
<root><head/><content/></root>

This will probably be enough for your needs, but some warnings to be on the safe side:

这可能足以满足您的需求,但为了安全起见,有一些警告:

This will just remove whitespace nodes between elements, and try not to remove whitespace nodes inside elements with mixed content:

这只会删除元素之间的空白节点,并尽量不删除具有混合内容的元素内的空白节点:

>>> elem = etree.XML('<p> spam <a>ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p> spam <a>ham</a> <a>eggs</a></p>

Leading or trailing whitespace from textnodes will not be removed. It will however still in some circumstances remove whitespace nodes from mixed content: if the parser has not encountered non-whitespace nodes at that level yet.

文本节点的前导或尾随空格不会被删除。然而,在某些情况下,它仍会从混合内容中删除空白节点:如果解析器尚未遇到该级别的非空白节点。

>>> elem = etree.XML('<p><a> ham</a> <a>eggs</a></p>', parser=parser)
>>> print etree.tostring(elem)
<p><a> ham</a><a>eggs</a></p>

If you don't want that, you can use xml:space="preserve", which will be respected. Another option would be using a dtd and use etree.XMLParser(load_dtd=True), where the parser will use the dtd to determine which whitespace nodes are significant or not.

如果你不想那样,你可以使用xml:space="preserve",这将被尊重。另一种选择是使用 dtd 和 use etree.XMLParser(load_dtd=True),解析器将使用 dtd 来确定哪些空白节点是重要的或不重要的。

Other than that, you will have to write your own code to remove the whitespace you don't want (iterating descendants, and where appropriate, set .textand .tailproperties that contain only whitespace to Noneor empty string)

除此之外,您必须编写自己的代码来删除您不想要的空格(迭代后代,并在适当的情况下,设置.text.tail属性只包含空格 toNone或空字符串)

回答by Tony Veijalainen

xmlStr = ' '.join(xmlStr.split()))

This puts all text in one line replacing multiple white space with single blank.

这将所有文本放在一行中,用单个空白替换多个空白。

xmlStr = ''.join(xmlStr.split()))

This would remove completely space including the spaces inside the text and can not be used.

这将完全删除空间,包括文本内的空格,并且不能使用。

The first form could be used with risk (but that you request), for the input you gave:

对于您提供的输入,第一种形式可能有风险(但您要求):

xmlStr = '''<root>
    <head></head>
    <content></content>
</root>'''
xmlStr = ' '.join(xmlStr.split())
print xmlStr
""" Output:
<root> <head></head> <content></content> </root>
"""

This would be valid xml. It would need to be though checked with some kind of xml checker maybe. Are you by the way sure you want XML? Have you read the article: Python Is Not Java

这将是有效的 xml。它可能需要使用某种 xml 检查器进行检查。顺便说一下,您确定要使用 XML 吗?你读过这篇文章: Python 不是 Java

回答by Thanatos

Whitespace is significant within an XML document. Using whitespace for indentation is a poor use of XML, as it introduces significant data where there really is none -- and sadly, this is the norm. Any programmatic approach you take to stripping out whitespace will be, at best, a guess - you need better knowledge of what the XML is conveying to properly remove whitespace, without stepping on some piece of data's toes.

空格在 XML 文档中很重要。使用空格进行缩进是 XML 的一种糟糕使用,因为它引入了真正没有的重要数据——遗憾的是,这是常态。您用于去除空格的任何编程方法充其量只是一个猜测 - 您需要更好地了解 XML 所传达的内容才能正确删除空格,而不会踩到某些数据的脚趾。

回答by Brabitom

A little clumsy solution without lxml:-)

一个没有 lxml 的笨拙解决方案:-)

data = """<root>

    <head></head>    <content></content>

</root>"""

data3 = []
data2 = data.split('\n')
for x in data2:
    y = x.strip()
    if y: data3.append(y)
data4 = ''.join(data3)
data5 = data4.replace("  ","").replace("> <","><")

print data5

Output: <root><head></head><content></content></root>

回答by jimk

If whitespace in "non-leaf" nodes is what we're trying to remove then the following function will do it (recursively if specified):

如果“非叶”节点中的空白是我们要删除的内容,则以下函数将执行此操作(如果指定则递归):

from xml.dom import Node

def stripNode(node, recurse=False):
    nodesToRemove = []
    nodeToBeStripped = False

    for childNode in node.childNodes:
        # list empty text nodes (to remove if any should be)
        if (childNode.nodeType == Node.TEXT_NODE and childNode.nodeValue.strip() == ""):
            nodesToRemove.append(childNode)

        # only remove empty text nodes if not a leaf node (i.e. a child element exists)
        if childNode.nodeType == Node.ELEMENT_NODE:
            nodeToBeStripped = True

    # remove flagged text nodes
    if nodeToBeStripped:
        for childNode in nodesToRemove:
            node.removeChild(childNode)

    # recurse if specified
    if recurse:
        for childNode in node.childNodes:
            stripNode(childNode, True)

However, Thanatos is correct. Whitespace can represent data in XML so use with caution.

然而,塔纳托斯是正确的。空格可以表示 XML 中的数据,因此请谨慎使用。

回答by jimk

Here's something quick I came up with because I didn't want to use lxml:

这是我想出的快速方法,因为我不想使用 lxml:

from xml.dom import minidom
from xml.dom.minidom import Node

def remove_blanks(node):
    for x in node.childNodes:
        if x.nodeType == Node.TEXT_NODE:
            if x.nodeValue:
                x.nodeValue = x.nodeValue.strip()
        elif x.nodeType == Node.ELEMENT_NODE:
            remove_blanks(x)

xml = minidom.parse('file.xml')
remove_blanks(xml)
xml.normalize()
with file('file.xml', 'w') as result:
    result.write(xml.toprettyxml(indent = '  '))

Which I really only needed to re-indent the XML file with otherwise broken indentation. It doesn't respect the preservedirective, but, honestly, so do so many other software dealing with XMLs, that it's rather a funny requirement :) Also, you'd be able to easily add that sort of functionality to the code above (just check for spaceattribute, and don't recure if its value is 'preserve'.)

我真的只需要使用其他损坏的缩进重新缩进 XML 文件。它不尊重preserve指令,但是,老实说,处理 XML 的许多其他软件也是如此,这是一个相当有趣的要求 :) 此外,您可以轻松地将这种功能添加到上面的代码中(只是检查space属性,如果其值为“保留”,则不要重复。)

回答by cmelx

xmlStr = xmlDocument.toprettyxml(indent='\t', newl='\n', encoding='UTF-8')
fix = re.compile(r'((?<=>)(\n[\t]*)(?=[^<\t]))|(?<=[^>\t])(\n[\t]*)(?=<)')
newXmlStr = re.sub(fix, '', xmlStr )

from this source

这个来源

回答by Steve Goossens

The only thing that bothers me about xml.dom.minidom's toprettyxml() is that it adds blank lines. I don't seem to get the split components, so I just wrote a simple function to remove the blank lines:

xml.dom.minidom 的 toprettyxml() 唯一困扰我的是它添加了空行。我似乎没有得到拆分组件,所以我只是写了一个简单的函数来删除空行:

#!/usr/bin/env python

import xml.dom.minidom

# toprettyxml() without the blank lines
def prettyPrint(x):
    for line in x.toprettyxml().split('\n'):
        if not line.strip() == '':
            print line

xml_string = "<monty>\n<example>something</example>\n<python>parrot</python>\n</monty>"

# parse XML
x = xml.dom.minidom.parseString(xml_string)

# clean
prettyPrint(x)

And this is what the code outputs:

这是代码输出的内容:

<?xml version="1.0" ?>
<monty>
        <example>something</example>
        <python>parrot</python>
</monty>

If I use toprettyxml() by itself, i.e. print(toprettyxml(x)), it adds unnecessary blank lines:

如果我单独使用 toprettyxml(),即 print(toprettyxml(x)),它会添加不必要的空行:

<?xml version="1.0" ?>
<monty>


        <example>something</example>


        <python>parrot</python>


</monty>