Python 如何在 lxml 中通过 find/findall 使用 xml 命名空间?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4210730/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 14:46:59  来源:igfitidea点击:

How do I use xml namespaces with find/findall in lxml?

pythonxmllxmlxml-namespaceselementtree

提问by saffsd

I'm trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in 'content.xml'.

我正在尝试解析 OpenOffice ODS 电子表格中的内容。ods 格式本质上只是一个包含许多文档的 zip 文件。电子表格的内容存储在“content.xml”中。

import zipfile
from lxml import etree

zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))

The content of the spreadsheet is in a cell:

电子表格的内容在一个单元格中:

table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')

We can also go straight for the rows:

我们也可以直接查看行:

rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')

The individual elements know about the namespaces:

各个元素了解命名空间:

>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'

How do I use the namespaces directly in find/findall?

如何直接在 find/findall 中使用命名空间?

The obvious solution does not work.

显而易见的解决方案不起作用。

Trying to get the rows from the table:

试图从表中获取行:

>>> root.findall('.//table:table')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
  File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
    return list(iterfind(elem, path))
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
    selector = _build_path_iterator(path)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
KeyError: ':'

采纳答案by jfs

If root.nsmapcontains the tablenamespace prefix then you could:

如果root.nsmap包含table命名空间前缀,那么您可以:

root.xpath('.//table:table', namespaces=root.nsmap)

findall(path)accepts {namespace}namesyntax instead of namespace:name. Therefore pathshould be preprocessed using namespace dictionary to the {namespace}nameform before passing it to findall().

findall(path)接受{namespace}name语法而不是namespace:name. 因此path{namespace}name在将其传递给findall().

回答by ChrisR

Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).

这是一种获取 XML 文档中所有名称空间的方法(假设没有前缀冲突)。

I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.

我在解析 XML 文档时使用它,我事先知道命名空间 URL 是什么,并且只有前缀。

        doc = etree.XML(XML_string)

        # Getting all the name spaces.
        nsmap = {}
        for ns in doc.xpath('//namespace::*'):
            if ns[0]: # Removes the None namespace, neither needed nor supported.
                nsmap[ns[0]] = ns[1]
        doc.xpath('//prefix:element', namespaces=nsmap)

回答by RockyRoad

Maybe the first thing to notice is that the namespaces are defined at Element level, not Document level.

也许首先要注意的是命名空间是在元素级别而不是文档级别定义的。

Most often though, all namespaces are declared in the document's root element (office:document-contenthere), which saves us parsing it all to collect inner xmlnsscopes.

但最常见的是,所有命名空间都在文档的根元素(office:document-content此处)中声明,这使我们无需解析所有内容以收集内部xmlns作用域。

Then an element nsmap includes :

然后一个元素 nsmap 包括:

  • a default namespace, with Noneprefix (not always)
  • all ancestors namespaces, unless overridden.
  • 带有None前缀的默认命名空间(并非总是如此)
  • 所有祖先命名空间,除非被覆盖。

If, as ChrisR mentionned, the default namespace is not supported, you can use a dict comprehensionto filter it out in a more compact expression.

如果,正如 ChrisR 所提到的,不支持默认命名空间,您可以使用dict推导以更紧凑的表达式将其过滤掉。

You have a slightly different syntax for xpath and ElementPath.

xpath 和ElementPath 的语法略有不同 。



So here's the code you could use to get all your first table's rows (tested with: lxml=3.4.2) :

所以这是您可以用来获取所有第一个表的行的代码(已测试:)lxml=3.4.2

import zipfile
from lxml import etree

# Open and parse the document
zf = zipfile.ZipFile('spreadsheet.ods')
tree = etree.parse(zf.open('content.xml'))

# Get the root element
root = tree.getroot()

# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}

# use defined prefixes to access elements
table = tree.find('.//table:table', nsmap)
rows = table.findall('table:table-row', nsmap)

# or, if xpath is needed:
table = tree.xpath('//table:table', namespaces=nsmap)[0]
rows = table.xpath('table:table-row', namespaces=nsmap)

回答by dsummersl

Etree won't find namespaced elements if there are no xmlnsdefinitions in the XML file. For instance:

如果xmlnsXML 文件中没有定义,Etree 将找不到命名空间元素。例如:

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'

tree = etree.fromstring(xml_doc)

# finds nothing:
tree.find('.//ns:root', {'ns': 'foo'})
tree.find('.//{foo}root', {'ns': 'foo'})
tree.find('.//ns:root')
tree.find('.//ns:root')

Sometimes that is the data you are given. So, what can you do when there is no namespace?

有时这就是你得到的数据。那么,当没有命名空间时你能做什么呢?

My solution: add one.

我的解决方案:加一个。

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
xml_doc_with_ns = '<ROOT xmlns:ns="foo">%s</ROOT>' % xml_doc

tree = etree.fromstring(xml_doc_with_ns)

# finds what you're looking for:
tree.find('.//{foo}root')