python 使用 lxml.etree.iterparse 解析损坏的 XML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2352840/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing broken XML with lxml.etree.iterparse
提问by erikcw
I'm trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the file contains some bad ascii characters that break the default parser. The parser works if I set recover=True, but the iterparse method doesn't take the recover parameter or a custom parser object. Does anyone know how to use iterparse to parse broken xml?
我正在尝试以内存高效的方式使用 lxml 解析一个巨大的 xml 文件(即从磁盘懒惰地流式传输而不是将整个文件加载到内存中)。不幸的是,该文件包含一些破坏默认解析器的坏 ASCII 字符。如果我设置了recover=True,解析器就可以工作,但是iterparse 方法不接受recover 参数或自定义解析器对象。有谁知道如何使用 iterparse 来解析损坏的 xml?
#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)
#how do I do the equivalent with iterparse? (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from
Thanks for your help!
谢谢你的帮助!
EDIT -- Here is an example of the types of encoding errors I'm running into:
编辑-这是我遇到的编码错误类型的示例:
In [17]: data
Out[17]: '\t<articletext><p>The cafeteria rang with excited voices. Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town. We, of course, were glad to entertain such a worthy group and immediately agreed . One wag joked, "Which uniform should we wear?" followed with, "Oh, that\'s right, they\'ll never notice." The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.</p><p>A small stage was set up for us and a pretty decent P.A. system was donated for the occasion. The audience was made up of blind persons of every age, from the thirties to the nineties. Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally. I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on. After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program. Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind. We didn\'t mind at all that some sang along \x1e they enjoyed it so much.</p><p>In fact, a popular part of our program is when the audience gets to sing some of the old favorites. The harmony parts were quite evident as they tried their voices to the different parts. I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important. We received a big hand at the finale and were made to promise to return the following year. Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal. As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?" Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.</p><p>Retired portrait photographer. Main hobby - quartet singing.</p></articletext>\n'
In [18]: lxml.etree.from
lxml.etree.fromstring lxml.etree.fromstringlist
In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError Traceback (most recent call last)
/mnt/articles/<ipython console> in <module>()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()
/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()
XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190
In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}
As you can see, chardet thinks it is an ascii file, but there is a "\x1e" right in the middle of this example which is making lxml raise an exception.
正如你所看到的,chardet 认为它是一个 ascii 文件,但是在这个例子的中间有一个“\x1e”,它使 lxml 引发异常。
采纳答案by erikcw
I solved the problem by creating a class with a File like object interface. The class' read() method reads a line from the file and replaces any "bad characters" before returning the line to iterparse.
我通过创建一个带有类似文件的对象接口的类来解决这个问题。类的 read() 方法从文件中读取一行并在将该行返回到 iterparse 之前替换任何“坏字符”。
#psudo code
class myFile(object):
def __init__(self, filename):
self.f = open(filename)
def read(self, size=None):
return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')
#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')
I had to edit the myFile class a few times adding some more replace() calls for a few other characters that were making lxml choke. I think lxml's SAX parsing would have worked as well (seems to support the recover option), but this solution worked like a charm!
我不得不多次编辑 myFile 类,为其他一些使 lxml 阻塞的字符添加更多的 replace() 调用。我认为 lxml 的 SAX 解析也能正常工作(似乎支持恢复选项),但是这个解决方案很有效!
回答by Purrell
Edit:
编辑:
This is an older answer and I would have done it differently today. And I'm not just referring to the dumb snark ... since then BeutifulSoup4is available and it's really quite nice. I recommend that to anyone who stumbles over here.
这是一个较旧的答案,我今天会采取不同的方式。而且我不只是指愚蠢的 snark ......从那时起BeutifulSoup4可用,它真的非常好。我建议任何在这里绊倒的人。
The currently accepted answer is, well, not what one should do. The question itself also has a bad assumption:
目前接受的答案是,好吧,不是应该做什么。这个问题本身也有一个不好的假设:
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
parser = lxml.etree.XMLParser(recover=True) #从坏字符中恢复。
Actually recover=True
is for recovering from misformed XML. There is however an "encoding" optionwhich would have fixed your issue.
实际上recover=True
是为了从错误的 XML 中恢复。然而,有一个“编码”选项可以解决您的问题。
parser = lxml.etree.XMLParser(encoding='utf-8' #Your encoding issue.
recover=True, #I assume you probably still want to recover from bad xml, it's quite nice. If not, remove.
)
That's it, that's the solution.
就是这样,这就是解决方案。
BTW --For anyone struggling with parsing XML in python, especially from third party sources. I know, I know, the documentation is bad and there are a lot of SO red herrings; a lot of bad advice.
顺便说一句——对于在 python 中解析 XML 的任何人,尤其是来自第三方来源的人。我知道,我知道,文档很糟糕,而且有很多红鲱鱼;很多不好的建议。
- lxml.etree.fromstring()?- That's for perfectly formed XML, silly
- BeautifulStoneSoup?- Slow, and has a way-stupid policy for self closing tags
- lxml.etree.HTMLParser()?- (because the xml is broken) Here's a secret - HTMLParser() is... a Parser with recover=True
- lxml.html.soupparser?- The encoding detection is supposed to be better, but it has the same failings of BeautifulSoup for self closing tags. Perhaps you can combine XMLParser with BeautifulSoup's UnicodeDammit
- UnicodeDammit and other cockamamie stuff to fix encodings?- Well, UnicodeDammit is kind of cute, I like the name and it's useful for stuff beyond xml, but things are usually fixed if you do the right thing with XMLParser()
- lxml.etree.fromstring()?- 那是为了完美形成的 XML,愚蠢
- 美丽的石头汤?- 缓慢,并且对自动关闭标签有一种愚蠢的策略
- lxml.etree.HTMLParser()?- (因为 xml 被破坏了)这是一个秘密 - HTMLParser() 是...一个带有 recovery=True 的解析器
- lxml.html.soupparser?- 编码检测应该更好,但它具有与 BeautifulSoup 相同的自关闭标签的失败。也许您可以将 XMLParser 与 BeautifulSoup 的 UnicodeDammit 结合使用
- UnicodeDammit 和其他 cockamamie 的东西来修复编码?- 嗯,UnicodeDammit 有点可爱,我喜欢这个名字,它对 xml 以外的东西很有用,但如果你用 XMLParser() 做正确的事情,事情通常会得到修复
You could be trying all sorts of stuff from what's available online. lxml documentation could be better. The code above is what you need for 90% of your XML parsing cases. Here I'll restate it:
您可能正在尝试在线提供的各种内容。lxml 文档可能会更好。上面的代码是 90% 的 XML 解析案例所需要的。在这里我再重申一遍:
magical_parser = XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object
You're welcome. My headaches == your sanity. Plus it has other features you might need for, you know, XML.
别客气。我的头痛 == 你的理智。此外,它还具有您可能需要的其他功能,您知道,XML。
回答by John Machin
Edit your question, stating what happens (exact error message and traceback (copy/paste, don't type from memory)) to make you think that "bad unicode" is the problem.
编辑您的问题,说明会发生什么(确切的错误消息和回溯(复制/粘贴,不要从内存中键入)),让您认为“错误的 unicode”是问题所在。
Get chardetand feed it your MySQL dump. Tell us what it says.
获取chardet并将其提供给您的 MySQL 转储。告诉我们它说了什么。
Show us the first 200 to 300 bytes of your dump, using e.g. print repr(dump[:300])
向我们展示转储的前 200 到 300 个字节,使用例如 print repr(dump[:300])
UpdateYou wrote """As you can see, chardet thinks it is an ascii file, but there is a "\x1e" right in the middle of this example which is making lxml raise an exception."""
更新你写了 """如你所见,chardet 认为它是一个 ascii 文件,但在这个例子的中间有一个 "\x1e",它使 lxml 引发异常。"""
I see no "bad unicode" here.
我在这里看不到“糟糕的 unicode”。
chardet is correct. What makes you think that "\x1e" is not ASCII? It is an ASCII character, a C0 control character named "RECORD SEPARATOR".
chardet 是正确的。是什么让您认为“\x1e”不是 ASCII?它是一个 ASCII 字符,一个名为“RECORD SEPARATOR”的 C0 控制字符。
The error message says that you have an invalid character. That is also correct. The only control characters that are valid in XML are "\t"
, "\r"
and "\n"
. MySQL should be grumbling about that and/or offering you a way of escaping it e.g. _x001e_
(yuk!)
错误消息说您有一个无效字符。这也是正确的。XML 中唯一有效的控制字符是"\t"
,"\r"
和"\n"
。MySQL应该对此抱怨和/或为您提供一种逃避它的方法,例如_x001e_
(yuk!)
Given the context, it looks like that character could be deleted with no loss. You may wish to fix your database or you may wish to remove suchlike characters from your dump (after checking that they are all vanishable) or you may wish to choose a less picky and less volumnious output format than XML.
鉴于上下文,看起来该字符可以毫无损失地删除。您可能希望修复您的数据库,或者您可能希望从您的转储中删除此类字符(在检查它们都可以消失之后),或者您可能希望选择一种比 XML 更不挑剔和体积更少的输出格式。
Update 2You presumably want to user iterparse()
not because it's your end goal but because you want to save memory. If you used a format like CSV you wouldn't have a memory problem.
更新 2您大概想要用户iterparse()
不是因为这是您的最终目标,而是因为您想节省内存。如果你使用像 CSV 这样的格式,你就不会有内存问题。
Update 3In response to a comment by @Purrell:
更新 3回应@Purrell 的评论:
try it yourself, dude. pastie.org/3280965
你自己试试吧,伙计。Pastie.org/3280965
Here's the contents of that pastie; it deserves preservation:
这是馅饼的内容;它值得保存:
from lxml.etree import etree
data = '\t<articletext><p>The cafeteria rang with excited voices. Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town. We, of course, were glad to entertain such a worthy group and immediately agreed . One wag joked, "Which uniform should we wear?" followed with, "Oh, that\'s right, they\'ll never notice." The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.</p><p>A small stage was set up for us and a pretty decent P.A. system was donated for the occasion. The audience was made up of blind persons of every age, from the thirties to the nineties. Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally. I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on. After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program. Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind. We didn\'t mind at all that some sang along \x1e they enjoyed it so much.</p><p>In fact, a popular part of our program is when the audience gets to sing some of the old favorites. The harmony parts were quite evident as they tried their voices to the different parts. I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important. We received a big hand at the finale and were made to promise to return the following year. Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal. As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?" Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.</p><p>Retired portrait photographer. Main hobby - quartet singing.</p></articletext>\n'
magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(data), magical_parser)
To get it to run, one import needs to be fixed, and another supplied. The data is monstrous. There is no output to show the result. Here's a replacement with the data cut down to the bare essentials. The 5 pieces of ASCII text (excluding <
and >
) that are all valid XML characters are replaced by t1
, ..., t5
. The offending \x1e
is flanked by t2
and t3
.
为了让它运行,需要修复一个导入,并提供另一个。数据是可怕的。没有输出来显示结果。这是将数据缩减为基本要素的替代品。5 段 ASCII 文本(不包括<
和>
)都是有效的 XML 字符,替换为t1
, ..., t5
。冒犯\x1e
的两侧是t2
和t3
。
[output wraps at column 80]
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> data = '<article><p>t1</p><p>t2\x1et3</p><p>t4
</p><p>t5</p></article>'
>>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
>>> tree = etree.parse(StringIO(data), magical_parser)
>>> print(repr(tree.getroot().text))
'<p>t1</p><p>t2t3/ppt4/ppt5/p'
Not what I'd call "recovery"; after the bad character, the <
and >
characters disappear.
不是我所说的“恢复”;在坏字符之后,<
和>
字符消失。
The pastie was in response to my question "What gives you the idea that encoding='utf-8' will solve his problem?". This was triggered by the statement 'There is however an "encoding" option which would have fixed your issue.' But encoding=ascii produces the same output. So does omitting the encoding arg. It's NOT an encoding problem.Case closed.
馅饼是为了回答我的问题“是什么让您认为 encoding='utf-8' 会解决他的问题?”。这是由“然而,有一个“编码”选项可以解决您的问题。但是 encoding=ascii 产生相同的输出。省略编码参数也是如此。这不是编码问题。案件结案。