Python 如何在 lxml 中通过 find/findall 使用 xml 命名空间？

Question

提问by saffsd

I'm trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in 'content.xml'.

我正在尝试解析 OpenOffice ODS 电子表格中的内容。ods 格式本质上只是一个包含许多文档的 zip 文件。电子表格的内容存储在“content.xml”中。

import zipfile
from lxml import etree

zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))

The content of the spreadsheet is in a cell:

电子表格的内容在一个单元格中：

table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')

We can also go straight for the rows:

我们也可以直接查看行：

rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')

The individual elements know about the namespaces:

各个元素了解命名空间：

>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'

How do I use the namespaces directly in find/findall?

如何直接在 find/findall 中使用命名空间？

The obvious solution does not work.

显而易见的解决方案不起作用。

Trying to get the rows from the table:

试图从表中获取行：

>>> root.findall('.//table:table')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
  File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
    return list(iterfind(elem, path))
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
    selector = _build_path_iterator(path)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
KeyError: ':'

Answer 1

采纳答案by jfs

If root.nsmapcontains the tablenamespace prefix then you could:

如果root.nsmap包含table命名空间前缀，那么您可以：

root.xpath('.//table:table', namespaces=root.nsmap)

findall(path)accepts {namespace}namesyntax instead of namespace:name. Therefore pathshould be preprocessed using namespace dictionary to the {namespace}nameform before passing it to findall().

findall(path)接受{namespace}name语法而不是namespace:name. 因此path，{namespace}name在将其传递给findall().

Answer 2

回答by ChrisR

Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).

这是一种获取 XML 文档中所有名称空间的方法（假设没有前缀冲突）。

I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.

我在解析 XML 文档时使用它，我事先知道命名空间 URL 是什么，并且只有前缀。

        doc = etree.XML(XML_string)

        # Getting all the name spaces.
        nsmap = {}
        for ns in doc.xpath('//namespace::*'):
            if ns[0]: # Removes the None namespace, neither needed nor supported.
                nsmap[ns[0]] = ns[1]
        doc.xpath('//prefix:element', namespaces=nsmap)

Answer 3

回答by RockyRoad

Maybe the first thing to notice is that the namespaces are defined at Element level, not Document level.

也许首先要注意的是命名空间是在元素级别而不是文档级别定义的。

Most often though, all namespaces are declared in the document's root element (office:document-contenthere), which saves us parsing it all to collect inner xmlnsscopes.

但最常见的是，所有命名空间都在文档的根元素（office:document-content此处）中声明，这使我们无需解析所有内容以收集内部xmlns作用域。

Then an element nsmap includes :

然后一个元素 nsmap 包括：

a default namespace, with Noneprefix (not always)
all ancestors namespaces, unless overridden.

带有None前缀的默认命名空间（并非总是如此）
所有祖先命名空间，除非被覆盖。

If, as ChrisR mentionned, the default namespace is not supported, you can use a dict comprehensionto filter it out in a more compact expression.

如果，正如 ChrisR 所提到的，不支持默认命名空间，您可以使用dict推导以更紧凑的表达式将其过滤掉。

You have a slightly different syntax for xpath and ElementPath.

xpath 和ElementPath 的语法略有不同。

So here's the code you could use to get all your first table's rows (tested with: lxml=3.4.2) :

所以这是您可以用来获取所有第一个表的行的代码（已测试：）lxml=3.4.2：

import zipfile
from lxml import etree

# Open and parse the document
zf = zipfile.ZipFile('spreadsheet.ods')
tree = etree.parse(zf.open('content.xml'))

# Get the root element
root = tree.getroot()

# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}

# use defined prefixes to access elements
table = tree.find('.//table:table', nsmap)
rows = table.findall('table:table-row', nsmap)

# or, if xpath is needed:
table = tree.xpath('//table:table', namespaces=nsmap)[0]
rows = table.xpath('table:table-row', namespaces=nsmap)

Answer 4

回答by dsummersl

Etree won't find namespaced elements if there are no xmlnsdefinitions in the XML file. For instance:

如果xmlnsXML 文件中没有定义，Etree 将找不到命名空间元素。例如：

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'

tree = etree.fromstring(xml_doc)

# finds nothing:
tree.find('.//ns:root', {'ns': 'foo'})
tree.find('.//{foo}root', {'ns': 'foo'})
tree.find('.//ns:root')
tree.find('.//ns:root')

Sometimes that is the data you are given. So, what can you do when there is no namespace?

有时这就是你得到的数据。那么，当没有命名空间时你能做什么呢？

My solution: add one.

我的解决方案：加一个。

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
xml_doc_with_ns = '<ROOT xmlns:ns="foo">%s</ROOT>' % xml_doc

tree = etree.fromstring(xml_doc_with_ns)

# finds what you're looking for:
tree.find('.//{foo}root')

Python 如何在 lxml 中通过 find/findall 使用 xml 命名空间？

提问by saffsd

采纳答案by jfs

回答by ChrisR

回答by RockyRoad

回答by dsummersl

相关推荐

最近更新

标签

Python 如何在 lxml 中通过 find/findall 使用 xml 命名空间？

提问by saffsd

采纳答案by jfs

回答by ChrisR

回答by RockyRoad

回答by dsummersl

相关推荐

Python 如何使用 Pygame 围绕其中心旋转图像？

在 Python 中使用新式属性“无法设置属性”

Python列表按降序排序

在python中str到时间

相关推荐

最近更新

标签