python 用于缓解 UTF-8 问题的 ElementTree 的替代 XML 解析器?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1139090/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Alternative XML parser for ElementTree to ease UTF-8 woes?
提问by Kekoa
I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.
我正在使用 elementtree.parse() 函数解析一些 XML。它可以工作,除了一些 utf-8 字符(128 以上的单字节字符)。我看到默认解析器是基于 expat 的 XMLTreeBuilder。
Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?
是否有我可以使用的替代解析器可能不那么严格并允许使用 utf-8 字符?
This is the error I'm getting with the default parser:
这是我在使用默认解析器时遇到的错误:
ExpatError: not well-formed (invalid token): line 311, column 190
The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í
导致这种情况的字符是单字节 x92(十六进制)。我不确定这甚至是一个有效的 utf-8 字符。但是处理它会很好,因为大多数文本编辑器将其显示为:í
EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74
编辑:字符的上下文是: canít ,我认为它应该是一个花哨的撇号,但在十六进制编辑器中,相同的序列是:63 61 6E 92 74
回答by John Machin
I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"
我将从问题开始:“是否有我可以使用的替代解析器,它可能不那么严格并允许使用 utf-8 字符?”
All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.
所有 XML 解析器都将接受以 UTF-8 编码的数据。事实上,UTF-8 是默认编码。
An XML document may start with a declaration like this:
XML 文档可能以这样的声明开头:
`<?xml version="1.0" encoding="UTF-8"?>`
or like this:
<?xml version="1.0"?>
or not have a declaration at all ... in each case the parser will decode the document using UTF-8.
或者像这样:
<?xml version="1.0"?>
或者根本没有声明……在每种情况下,解析器都将使用 UTF-8 解码文档。
However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.
但是,您的数据不是以 UTF-8 编码的……它可能是 Windows-1252 又名 cp1252。
If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:
如果编码不是 UTF-8,那么创建者应该包含一个声明(或者接收者可以预先声明)或者接收者可以将数据转码为 UTF-8。以下展示了哪些有效,哪些无效:
>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
回答by Glenn Maynard
It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:
看起来您有 CP1252 文本。如果是这样,它应该在文件的顶部指定,例如:
<?xml version="1.0" encoding="CP1252" ?>
This does work with ElementTree.
这确实适用于 ElementTree。
If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.
如果您自己创建这些文件,请不要以这种编码编写它们。将它们保存为 UTF-8,并尽自己的一份力量来帮助消除过时的文本编码。
If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:
如果您正在接收没有编码规范的 CP1252 数据,并且您确定它始终是 CP1252,您可以在将其发送到解析器之前将其转换为 UTF-8:
s.decode("CP1252").encode("UTF-8")
回答by Lennart Regebro
Ah. That is "can′t", obviously, and indeed, 0x92 is an apostrophe in many Windows code pages. Your editor assumes instead that it's a Mac file. ;)
啊。显然,这是“不能”的,事实上,0x92 是许多 Windows 代码页中的撇号。您的编辑器假定它是一个 Mac 文件。;)
If it's a one-off, fixing the file is the right thing to do. But almost always when you need to import other peoples XML there is a lot of things that simply do not agree with the stated encoding. I've found that the best solution is to decode with error setting 'xmlcharrefreplace', and in severe cases do your own custom character replacement that fixes the most common problems for that particular customer.
如果是一次性的,修复文件是正确的做法。但是几乎总是当您需要导入其他人的 XML 时,有很多事情根本不符合规定的编码。我发现最好的解决方案是使用错误设置 'xmlcharrefreplace' 进行解码,并且在严重的情况下进行您自己的自定义字符替换,以修复该特定客户的最常见问题。
I'll also recommend lxml as XML library in Python, but that's not the problem here.
我还将推荐 lxml 作为 Python 中的 XML 库,但这不是问题所在。
回答by Jon Skeet
Byte 0x92 is never valid as the first byteof a UTF-8 character. It can be valid as a subsequent byte, however. See this UTF-8 guidefor a table of valid byte sequences.
字节 0x92 永远不能作为UTF-8 字符的第一个字节有效。但是,它可以作为后续字节有效。有关有效字节序列表,请参阅此 UTF-8 指南。
Could you give us an idea of what bytes are surrounding 0x92? Does the XML declaration include a character encoding?
你能告诉我们 0x92 周围有哪些字节吗?XML 声明是否包含字符编码?