Linux 在 python 抓取脚本中解析 facebook mobile 时,lxml 错误“IOError: Error reading file”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9593990/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
提问by Gilles Quenot
I use a modified script from Logging into facebook with pythonpost :
我使用了一个修改过的脚本,从Logging into facebook with pythonpost :
#!/usr/bin/python2 -u
# -*- coding: utf8 -*-
facebook_email = "[email protected]"
facebook_passwd = "YOUR_PASSWORD"
import cookielib, urllib2, urllib, time, sys
from lxml import etree
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
headers = {
"User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7",
"Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5",
"Accept-Language" : "en-us,en;q=0.5",
"Accept-Charset" : "utf-8",
"Content-type": "application/x-www-form-urlencoded",
"Host": "m.facebook.com"
}
try:
params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'})
req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F', params, headers)
res = opener.open(req)
html = res.read()
except urllib2.HTTPError, e:
print e.msg
except urllib2.URLError, e:
print e.reason[1]
def fetch(url):
req = urllib2.Request(url,None,headers)
res = opener.open(req)
return res.read()
body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"), errors='ignore')
tree = etree.parse(body)
r = tree.xpath('/see_prev')
print r.text
When I execute the code, problems appears :
当我执行代码时,出现问题:
$ ./facebook_fetch_coms.py
Traceback (most recent call last):
File "./facebook_fetch_coms_classic_test.py", line 42, in <module>
tree = etree.parse(body)
File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56230)
File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)
File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)
File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)
File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691)
IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life."
The goal is first to get the link with id=see_prev
with lxml
, then using a while loop to open all comments, to finally fetch all messages in a file. Any help will be very appreciated !
目标是首先使用id=see_prev
with获取链接lxml
,然后使用 while 循环打开所有评论,最后获取文件中的所有消息。任何帮助将不胜感激!
Edit: I use Python 2.7.2 on archlinux x86_64 and lxml 2.3.3.
编辑:我在 archlinux x86_64 和 lxml 2.3.3 上使用 Python 2.7.2。
采纳答案by kindall
This is your problem:
这是你的问题:
tree = etree.parse(body)
The documentationsays that "source
is a filename or file object containing XML data." You have provided a string, so lxml is taking the text of your HTTP response body as the nameof the file you wish to open. No such file exists, so you get an IOError
.
该文件说:“source
是一个包含XML数据的文件名或文件对象。” 您提供了一个字符串,因此 lxml 将您的 HTTP 响应正文的文本作为您要打开的文件的名称。不存在这样的文件,所以你会得到一个IOError
.
The error message you get even says "Error reading file" and then gives your XML string as the name of the file it's trying to read,which is a mighty big hint about what's going on.
您收到的错误消息甚至说“错误读取文件”,然后将您的 XML 字符串作为它试图读取的文件的名称,这是一个关于正在发生的事情的强大提示。
You probably want etree.XML()
, which takes input from a string. Or you could just do tree = etree.parse(res)
to read directly from the HTTP request into lxml (the result of opener.open()
is a file-like object, and etree.parse()
should be perfectly happy to consume it).
您可能想要etree.XML()
,它从字符串中获取输入。或者你可以tree = etree.parse(res)
直接从 HTTP 请求中读取到 lxml(结果opener.open()
是一个类似文件的对象,并且etree.parse()
应该非常乐意使用它)。