Python lxml etree xmlparser 删除不需要的命名空间

Question

提问by Mark

I have an xml doc that I am trying to parse using Etree.lxml

我有一个 xml 文档，我正在尝试使用 Etree.lxml 解析它

<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  <Body>
<Envelope>

My code is:

我的代码是：

path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()

When I try to get dom.getroot() I get:

当我尝试获取 dom.getroot() 时，我得到：

<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>

However I only want:

不过我只想要：

<Element Envelope at 28adacac>

When i do

当我做

dom.getroot().find("Body")

I get nothing returned. However, when I

我什么也得不到。然而，当我

dom.getroot().find("{http://www.example.com/zzz/yyy}Body")

I get a result.

我得到一个结果。

I thought passing ns_clean=True to the parser would prevent this.

我认为将 ns_clean=True 传递给解析器会阻止这种情况。

Any ideas?

有任何想法吗？

Answer 1

采纳答案by unutbu

import io
import lxml.etree as ET

content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  </Body>
</Envelope>
'''    
dom = ET.parse(io.BytesIO(content))

You can find namespace-aware nodes using the xpathmethod:

您可以使用以下xpath方法查找命名空间感知节点：

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]

If you really want to remove namespaces, you could use an XSL transformation:

如果你真的想删除命名空间，你可以使用 XSL 转换：

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)

Here we see the namespace has been removed:

在这里我们看到命名空间已被删除：

print(ET.tostring(dom))
# <Envelope>
#   <Header>
#     <Version>1</Version>
#   </Header>
#   <Body>
#     some stuff
#   </Body>
# </Envelope>

So you can now find the Body node this way:

因此，您现在可以通过以下方式找到 Body 节点：

print(dom.find("Body"))
# <Element Body at 8506cd4>

Answer 2

回答by robert

You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.

您正在显示 repr() 调用的结果。当您以编程方式在树中移动时，您可以简单地选择忽略命名空间。

Answer 3

回答by dusan

Try using Xpath:

尝试使用 Xpath：

dom.xpath("//*[local-name() = 'Body']")

Taken (and simplified) from this page, under "The xpath() method" section

取自（并简化）从此页面，在“xpath() 方法”部分下

Answer 4

回答by Andrei

The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17can help you to avoid namespaces with little effort

https://bitbucket.org/olauzanne/pyquery/issue/17的最后一个解决方案可以帮助您轻松避免命名空间

apply xml.replace(' xmlns:', ' xmlnamespace:')to your xml before using pyquery so lxml will ignore namespaces

xml.replace(' xmlns:', ' xmlnamespace:')在使用 pyquery 之前应用到您的 xml 以便 lxml 将忽略名称空间

In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

在您的情况下，请尝试xml.replace(' xmlns="', ' xmlnamespace="'). 但是，如果正文中也需要字符串，您可能需要更复杂的东西。

Python lxml etree xmlparser 删除不需要的命名空间

提问by Mark

采纳答案by unutbu

回答by robert

回答by dusan

回答by Andrei

相关推荐

最近更新

标签

Python lxml etree xmlparser 删除不需要的命名空间

提问by Mark

采纳答案by unutbu

回答by robert

回答by dusan

回答by Andrei

相关推荐

如何在python中使用nosetest/unittest断言输出？

Python 使用 HTML5 websockets 实现基于网络的实时视频聊天

Python 如何更改字符串第一个字母的大小写？

Python PIL缩略图正在旋转我的图像？

相关推荐

最近更新

标签