Python 如何使用 ElementTree 正确解析 utf-8 xml?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21698024/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to correctly parse utf-8 xml with ElementTree?
提问by minerals
I need help to understand why parsing my xml file* with xml.etree.ElementTreeproduces the following errors.
我需要帮助来理解为什么用xml.etree.ElementTree解析我的 xml 文件 *会产生以下错误。
*My test xml file contains arabic characters.
*我的测试 xml 文件包含阿拉伯字符。
Task:Open and parse utf8_file.xmlfile.
任务:打开并解析utf8_file.xml文件。
My first try:
我的第一次尝试:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_tree = etree.parse(utf8_file)
Result 1:
结果 1:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)
My second try:
我的第二次尝试:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
xml_tree = etree.fromstring(xml_string)
Result 2:
结果 2:
AttributeError: 'file' object has no attribute 'getiterator'
Please explain the errors above and comment on the possible solution.
请解释上述错误并评论可能的解决方案。
采纳答案by Martijn Pieters
Leave decoding the bytes to the parser; do notdecode first:
将解码字节留给解析器;千万不能先解码:
import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
An XML file mustcontain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.
XML 文件的第一行必须包含足够的信息来处理解析器的解码。如果缺少标头,解析器必须假定使用 UTF-8。
Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.
因为保存这些信息的是 XML 标头,所以解析器有责任完成所有的解码。
Your first attempt failed because Python was trying to encodethe Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring()expects a parsed tree as first argument, not a unicode string.
您的第一次尝试失败了,因为 Python 试图再次对 Unicode 值进行编码,以便解析器可以按预期处理字节字符串。第二次尝试失败,因为etree.tostring()期望解析树作为第一个参数,而不是 unicode 字符串。

