windows lxml中的解析函数出错
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3116269/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
error with parse function in lxml
提问by silentNinJa
i have installed lxml2.2.2 on windows platform(i m using python version 2.6.5).i tried this simple command:
我已经在 windows 平台上安装了 lxml2.2.2(我使用的是 python 版本 2.6.5)。我尝试了这个简单的命令:
from lxml.html import parse
p= parse(‘http://www.google.com').getroot()
but i am getting the following error:
但我收到以下错误:
Traceback (most recent call last):
File “”, line 1, in p=parse(‘http://www.google.com').getroot()
File “C:\Python26\lib\site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\html_init_.py”, line 661, in parse return etree.parse(filenameorurl, parser, baseurl=baseurl, **kw)
File “lxml.etree.pyx”, line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:49590)
File “parser.pxi”, line 1491, in lxml.etree.parseDocument (src/lxml/lxml.etree.c:71205) File “parser.pxi”, line 1520, in lxml.etree.parseDocumentFromURL (src/lxml/lxml.etree.c:71488)
File “parser.pxi”, line 1420, in lxml.etree.parseDocFromFile (src/lxml/lxml.etree.c:70583)
File “parser.pxi”, line 975, in lxml.etree.BaseParser.parseDocFrom
File (src/lxml/lxml.etree.c:67736)
File “parser.pxi”, line 539, in lxml.etree.ParserContext.handleParseResultDoc (src/lxml/lxml.etree.c:63820)
File “parser.pxi”, line 625, in lxml.etree.handleParseResult (src/lxml/lxml.etree.c:64741)
File “parser.pxi”, line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64056)
IOError: Error reading file ‘http://www.google.com': failed to load external entity “http://www.google.com”
i am clueless as to what to do next as i am a newbie to python. please guide me to solve this error. thanks in advance!! :)
我不知道下一步该做什么,因为我是 Python 的新手。请指导我解决这个错误。提前致谢!!:)
回答by MattH
lxml.html.parse
does not fetch URLs.
lxml.html.parse
不获取 URL。
Here's how to do it with urllib2:
以下是使用 urllib2 执行此操作的方法:
>>> from urllib2 import urlopen
>>> from lxml.html import parse
>>> page = urlopen('http://www.google.com')
>>> p = parse(page)
>>> p.getroot()
<Element html at 1304050>
Update
Steven is right. lxml.etree.parse
should accept and load URLs. I missed that. I've tried deleting this answer, but I'm not allowed.
更新
史蒂文是对的。lxml.etree.parse
应该接受并加载 URL。我错过了。我试过删除这个答案,但我不被允许。
I retract my statement about it not fetching URLs.
我收回我关于它不获取 URL 的声明。
回答by Steven
According to the api docs it should work: http://lxml.de/api/lxml.html-module.html#parse
根据 api 文档它应该可以工作:http: //lxml.de/api/lxml.html-module.html#parse
This seems to be a bug in lxml 2.2.2. I just tested on windows with python 2.6 and 2.7 and it does work with 2.3.0.
这似乎是 lxml 2.2.2 中的一个错误。我刚刚在使用 python 2.6 和 2.7 的 Windows 上进行了测试,它确实适用于 2.3.0。
So: upgrade your lxml and you'll be fine.
所以:升级你的 lxml,你会没事的。
I don't know exactly in which versions of lxml the problem occurs, but I believe the problem was not so much with lxml itself, but with the version of libxml2 used to build the windows binary. (certain versions of libxml2 had a problem with http on windows)
我不知道问题发生在哪个版本的 lxml 中,但我相信问题不在于 lxml 本身,而在于用于构建 Windows 二进制文件的 libxml2 版本。(某些版本的 libxml2 在 windows 上有 http 问题)
回答by bmaupin
Since line breaks are not allowed in comments, here's my implementation of MattH's answer:
由于评论中不允许换行,这是我对MattH 的回答的实现:
from urllib2 import urlopen
from lxml.html import parse
site_url = ('http://www.google.com')
try:
page = parse(site_url).getroot()
except IOError:
page = parse(urlopen(site_url)).getroot()