Python：Unicode 和 ElementTree.parse

Question

提问by Santa

I'm trying to move to Python 2.7 and since Unicode is a Big Deal there, I'd try dealing with them with XML files and texts and parse them using the xml.etree.cElementTreelibrary. But I ran across this error:

我正在尝试转向 Python 2.7，因为 Unicode 在那里很重要，我会尝试用 XML 文件和文本处理它们并使用xml.etree.cElementTree库解析它们。但是我遇到了这个错误：

>>> import xml.etree.cElementTree as ET
>>> from io import StringIO
>>> source = """\
... <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
... <root>
...   <Parent>
...     <Child>
...       <Element>Text</Element>
...     </Child>
...   </Parent>
... </root>
... """
>>> srcbuf = StringIO(source.decode('utf-8'))
>>> doc = ET.parse(srcbuf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

The same thing happens using io.open('filename.xml', encoding='utf-8')to pass to ET.parse:

同样的事情发生使用io.open('filename.xml', encoding='utf-8')传递到ET.parse：

>>> with io.open('test.xml', mode='w', encoding='utf-8') as fp:
...     fp.write(source.decode('utf-8'))
...
150L
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     fp.read()
...
u'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>\n<root>\n  <Parent>\n
    <Child>\n      <Element>Text</Element>\n    </Child>\n  </Parent>\n</root>\n
'
>>> with io.open('test.xml', mode='r', encoding='utf-8') as fp:
...     ET.parse(fp)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 56, in parse
  File "<string>", line 35, in parse
cElementTree.ParseError: no element found: line 1, column 0

Is there something about unicode and ET parsing that I am missing here?

有什么关于 unicode 和 ET 解析的东西我在这里遗漏了吗？

edit: Apparently, the ET parser does not play well with unicode input stream? The following works:

编辑：显然，ET 解析器不能很好地处理 unicode 输入流？以下工作：

>>> with io.open('test.xml', mode='rb') as fp:
...     ET.parse(fp)
...
<ElementTree object at 0x0180BC10>

But this also means I cannot use io.StringIOif I want to parse from an in-memory text, unless I encode it first into an in-memory buffer?

但这也意味着io.StringIO如果我想从内存中的文本进行解析，我就不能使用，除非我先将它编码到内存中的缓冲区中？

Answer 1

采纳答案by Andre Holzner

Can't you use

你不能用吗

doc = ET.fromstring(source)

in your first example ?

在你的第一个例子中？

Answer 2

回答by Xiangju

I encountered the same problem as you in Python 2.6.

我在 Python 2.6 中遇到了和你一样的问题。

It seems that "utf-8" encoding for cElementTree.parse in Python 2.x and 3.x version are different. In Python 2.x, we can use XMLParser to encode the unicode. For example:

Python 2.x 和 3.x 版本中 cElementTree.parse 的“utf-8”编码似乎不同。在 Python 2.x 中，我们可以使用 XMLParser 对 unicode 进行编码。例如：

import xml.etree.cElementTree as etree

parser = etree.XMLParser(encoding="utf-8")
targetTree = etree.parse( "./targetPageID.xml", parser=parser )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

You can refer to this page for the XMLParser method (Section "XMLParser"): http://effbot.org/zone/elementtree-13-intro.htm

您可以参考此页面了解 XMLParser 方法（“XMLParser”部分）：http: //effbot.org/zone/elementtree-13-intro.htm

While the following method works for Python 3.x version:

虽然以下方法适用于 Python 3.x 版本：

import xml.etree.cElementTree as etree
import codecs

target_file = codecs.open("./targetPageID.xml",mode='r',encoding='utf-8')

targetTree = etree.parse( target_file )
pageIds = targetTree.find("categorymembers")
print "pageIds:",etree.tostring(pageIds)

Hope this can help you.

希望这可以帮到你。

Answer 3

回答by Glyph

Your problem is that you are feeding ElementTreeunicode, but it prefersto consume bytes. It will provideyou with unicode in any case.

您的问题是您正在提供ElementTreeunicode，但它更喜欢消耗字节。在任何情况下，它都会为您提供 unicode。

In Python 2.x, it can onlyconsume bytes. You can tell it what encoding those bytes are in, but that's it. So, if you literally have to work with an object that represents a text file, like io.StringIO, first you will need to convert it into something else.

在 Python 2.x 中，它只能消耗字节。您可以告诉它这些字节的编码方式，但仅此而已。因此，如果您确实必须使用表示文本文件的对象，例如io.StringIO，首先您需要将其转换为其他内容。

If you are literally starting with a 2.x-str(AKA bytes) in UTF-8 encoding, in memory, as in your example, use xml.etree.cElementTree.XMLto parse it into XML in one fell swoop and don't worry about any of this :-).

如果您实际上是从UTF-8 编码中的 2.x- str(AKA bytes)开始，则在内存中，如您的示例中所示，使用xml.etree.cElementTree.XML一举将其解析为 XML，并且不要担心任何这些:-) .

If you want an interface that can deal with data that is incrementally read from a file, use xml.etree.cElementTree.parsewith an io.BytesIOto convert it into an in-memory stream of bytes rather than an in-memory string of characters. If you want to use io.open, use it with the bflag, so that you get streams of bytes.

如果您想要一个可以处理从文件中增量读取的数据的接口，请使用xml.etree.cElementTree.parsewithio.BytesIO将其转换为内存中的字节流而不是内存中的字符串。如果要使用io.open，请将其与b标志一起使用，以便获得字节流。

In Python 3.x, you canpass unicode directly in to ElementTree, which is a bit more convenient, and arguably the newer version of ElementTree is more correct to allow this. However, you still might not want to, and Python 3's version does still accept bytes as input. You're always starting with bytes anyway: by passing them directly from your input source to ElementTree, you get to let it do its encoding or decoding intelligently inside the XML parsing engine, as well as do on-the-fly detection of encoding declarations within the input stream, which you can do with XML but you can't do with arbitrary textual data. So letting the XML parser do the work of decoding is the right place to put that responsibility.

在 Python 3.x 中，您可以将 unicode 直接传递给 ElementTree，这会更方便一些，并且可以说新版本的 ElementTree 更正确地允许这样做。但是，您可能仍然不想这样做，Python 3 的版本仍然接受字节作为输入。无论如何，您总是从字节开始：通过将它们直接从您的输入源传递到 ElementTree，您可以让它在 XML 解析引擎内智能地进行编码或解码，以及对编码声明进行即时检测在输入流中，您可以使用 XML 执行此操作，但不能使用任意文本数据执行此操作。因此，让 XML 解析器完成解码工作是承担该责任的正确位置。

Python：Unicode 和 ElementTree.parse

提问by Santa

采纳答案by Andre Holzner

回答by Xiangju

回答by Glyph

相关推荐

最近更新

标签

Python：Unicode 和 ElementTree.parse

提问by Santa

采纳答案by Andre Holzner

回答by Xiangju

回答by Glyph

相关推荐

如何在 Python 中删除远程 SFTP 服务器上目录中的所有文件？

如何在 IPython notebook 中打开交互式 matplotlib 窗口？

使用 Python 解析文件 (ics/icalendar)

Python 循环遍历文件夹中的所有 CSV 文件

相关推荐

最近更新

标签