Python Unicode 字符串的 lxml.etree.XML ValueError

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28534460/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:24:25  来源:igfitidea点击:

lxml.etree.XML ValueError for Unicode string

pythonpython-3.xunicodepython-3.4

提问by Papouche Guinslyzinho

I'm transforming an xmldocument with xslt. While doing it with python3 I had this following error. But I don't have any errors with python2

我正在使用xslt转换xml文档。使用 python3 执行此操作时,出现以下错误。但是我对 python2 没有任何错误

-> % python3 cstm/artefact.py
Traceback (most recent call last):
  File "cstm/artefact.py", line 98, in <module>
    simplify_this_dataset('fisheries-service-des-peches.xml')
  File "cstm/artefact.py", line 85, in simplify_this_dataset
    xslt_root = etree.XML(xslt_content)
  File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
  File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-

from lxml import etree

def simplify_this_dataset(dataset):
    """Create A simplify version of an xml file
    it will remove all the attributes and assign them as Elements instead
    """
    module_path = os.path.dirname(os.path.abspath(__file__))
    data = open(module_path+'/data/ex-fire.xslt')
    xslt_content = data.read()
    xslt_root = etree.XML(xslt_content)
    dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
    transform = etree.XSLT(xslt_root)
    result = transform(dom)
    f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
    f.write(str(result))
    f.close()

采纳答案by bobince

data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()

This implicitly decodes the bytes in the file to Unicode text, using the default encoding. (This might give wrong results, if the XML file isn't in that encoding.)

这使用默认编码将文件中的字节隐式解码为 Unicode 文本。(如果 XML 文件不在该编码中,这可能会产生错误的结果。)

xslt_root = etree.XML(xslt_content)

XML has its own handling and signalling for encodings, the <?xml encoding="..."?>prolog. If you pass a Unicode string starting with <?xml encoding="..."?>to a parser, the parser would like to reintrepret the rest of the byte string using that encoding... but can't, because you've already decoded the byte input to a Unicode string.

XML 有自己的编码处理和信号发送,即<?xml encoding="..."?>序言。如果您将一个以 开头的 Unicode 字符串传递<?xml encoding="..."?>给解析器,解析器希望使用该编码重新解释字节字符串的其余部分……但不能,因为您已经将字节输入解码为 Unicode 字符串。

Instead, you should either pass the undecoded byte string to the parser:

相反,您应该将未解码的字节字符串传递给解析器:

data = open(module_path+'/data/ex-fire.xslt', 'rb')

xslt_content = data.read()
xslt_root = etree.XML(xslt_content)

or, better, just have the parser read straight from the file:

或者,更好的是让解析器直接从文件中读取:

xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')

回答by Josh Allemon

You can also decode the UTF-8 string and encode it with ascii before passing it to etree.XML

您还可以解码 UTF-8 字符串并使用 ascii 对其进行编码,然后再将其传递给 etree.XML

 xslt_content = data.read()
 xslt_content = xslt_content.decode('utf-8').encode('ascii')
 xslt_root = etree.XML(xslt_content)

回答by Loki

I made it work by simply reencoding with the default options

我通过简单地使用默认选项重新编码来使它工作

xslt_content = data.read().encode()