通过“ElementTree”在 Python 中使用命名空间解析 XML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14853243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing XML with namespace in Python via 'ElementTree'
提问by Sudar
I have the following XML which I want to parse using Python's ElementTree:
我有以下 XML 想用 Python 解析ElementTree:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Classtags and then extract the value of all rdfs:labelinstances inside them. I am using the following code:
我想找到所有owl:Class标签,然后提取其中所有rdfs:label实例的值。我正在使用以下代码:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
由于命名空间,我收到以下错误。
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htmbut I am still not able to get this working since the above XML has multiple nested namespaces.
我尝试阅读http://effbot.org/zone/element-namespaces.htm 上的文档,但由于上述 XML 具有多个嵌套命名空间,因此我仍然无法使其正常工作。
Kindly let me know how to change the code to find all the owl:Classtags.
请让我知道如何更改代码以查找所有owl:Class标签。
采纳答案by Martijn Pieters
ElementTree is not too smart about namespaces. You need to give the .find(), findall()and iterfind()methods an explicit namespace dictionary. This is not documented very well:
ElementTree 在命名空间方面不太聪明。您需要为.find(),findall()和iterfind()方法提供一个显式命名空间字典。这没有很好地记录:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are onlylooked up in the namespacesparameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl:part, looks up the corresponding namespace URL in the namespacesdictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Classinstead. You can use the same syntax yourself too of course:
前缀只在namespaces你传入的参数中查找。这意味着你可以使用任何你喜欢的命名空间前缀;API 将这owl:部分拆分出来,在namespaces字典中查找相应的命名空间 URL ,然后将搜索更改为查找 XPath 表达式{http://www.w3.org/2002/07/owl}Class。当然,您也可以自己使用相同的语法:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
If you can switch to the lxmllibrarythings are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmapattribute on elements.
如果你可以切换到lxml图书馆,事情会更好;该库支持相同的 ElementTree API,但.nsmap在元素的属性中为您收集命名空间。
回答by Brad Dre
Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
以下是如何使用 lxml 执行此操作,而无需对名称空间进行硬编码或扫描它们的文本(如 Martijn Pieters 所述):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
更新:
5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
5 年后,我仍然遇到这个问题的变体。lxml 有帮助,正如我上面所展示的,但不是在所有情况下。在合并文档时,评论者可能对这种技术有一个有效的观点,但我认为大多数人都难以简单地搜索文档。
Here's another case and how I handled it:
这是另一个案例以及我如何处理它:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this
没有前缀的 xmlns 意味着没有前缀的标签获得这个默认命名空间。这意味着当您搜索 Tag2 时,您需要包含名称空间才能找到它。但是,lxml 创建了一个以 None 为键的 nsmap 条目,我找不到搜索它的方法。所以,我创建了一个像这样的新命名空间字典
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
回答by Davide Brunato
Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.
注意:这是一个对 Python 的 ElementTree 标准库有用的答案,无需使用硬编码命名空间。
To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparsefunction, parsing only namespace start events (start-ns):
要从 XML 数据中提取命名空间的前缀和 URI,您可以使用ElementTree.iterparse函数,仅解析命名空间开始事件 ( start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
然后字典可以作为参数传递给搜索函数:
root.findall('owl:Class', my_namespaces)
回答by MJM
I've been using similar code to this and have found it's always worth reading the documentation... as usual!
我一直在使用类似的代码,发现它总是值得阅读文档......像往常一样!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
findall() 只会找到当前标签的直接子元素。所以,并不是全部。
It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.
尝试让您的代码使用以下内容可能是值得的,特别是如果您正在处理大而复杂的 xml 文件,以便还包括子元素(等)。如果您知道自己 xml 中元素的位置,那么我想它会很好!只是觉得这值得记住。
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements"Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element's text content. Element.get() accesses the element's attributes:"
参考:https: //docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements“Element.findall() 仅查找带有标记的元素,这些元素是当前元素的直接子元素。 Element.find() 查找带有特定标签的第一个子元素,Element.text 访问元素的文本内容。Element.get() 访问元素的属性:”
回答by Bram Vanroy
To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:
要以命名空间格式获取命名空间,例如{myNameSpace},您可以执行以下操作:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
这样,您可以稍后在代码中使用它来查找节点,例如使用字符串插值(Python 3)。
link = root.find(f"{ns}link")
回答by peter.slizik
My solution is based on @Martijn Pieters' comment:
我的解决方案基于@Martijn Pieters 的评论:
register_namespaceonly influences serialisation, not search.
register_namespace只影响序列化,不影响搜索。
So the trick here is to use different dictionaries for serialization and for searching.
所以这里的技巧是使用不同的字典进行序列化和搜索。
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
现在,注册所有用于解析和写入的命名空间:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
为了搜索 ( find(), findall(), iterfind()) 我们需要一个非空前缀。将修改过的字典传递给这些函数(这里我修改了原始字典,但这必须在注册命名空间之后进行)。
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find()family can be used with the defaultprefix:
现在,该find()系列中的函数可以与default前缀一起使用:
print root.find('default:myelem', namespaces)
but
但
tree.write(destination)
does not use any prefixes for elements in the default namespace.
不对默认命名空间中的元素使用任何前缀。

