Linux 在 python 抓取脚本中解析 facebook mobile 时,lxml 错误“IOError: Error reading file”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9593990/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 05:01:16  来源:igfitidea点击:

lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script

pythonlinuxfacebookweb-scrapinglxml

提问by Gilles Quenot

I use a modified script from Logging into facebook with pythonpost :

我使用了一个修改过的脚本,从Logging into facebook with pythonpost :

#!/usr/bin/python2 -u
# -*- coding: utf8 -*-

facebook_email = "[email protected]"
facebook_passwd = "YOUR_PASSWORD"


import cookielib, urllib2, urllib, time, sys
from lxml import etree

jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)       
opener = urllib2.build_opener(cookie)

headers = {
    "User-Agent" : "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7",
    "Accept" : "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,text/png,*/*;q=0.5",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "utf-8",
    "Content-type": "application/x-www-form-urlencoded",
    "Host": "m.facebook.com"
}

try:
    params = urllib.urlencode({'email':facebook_email,'pass':facebook_passwd,'login':'Log+In'})
    req = urllib2.Request('http://m.facebook.com/login.php?m=m&refsrc=m.facebook.com%2F', params, headers)
    res = opener.open(req)
    html = res.read()

except urllib2.HTTPError, e:
    print e.msg
except urllib2.URLError, e:
    print e.reason[1]

def fetch(url):
    req = urllib2.Request(url,None,headers)
    res = opener.open(req)
    return res.read()

body = unicode(fetch("http://www.facebook.com/photo.php?fbid=404284859586659&set=a.355112834503862.104278.354259211255891&type=1"), errors='ignore')
tree = etree.parse(body)
r = tree.xpath('/see_prev')
print r.text

When I execute the code, problems appears :

当我执行代码时,出现问题:

$ ./facebook_fetch_coms.py
Traceback (most recent call last):
  File "./facebook_fetch_coms_classic_test.py", line 42, in <module>
    tree = etree.parse(body)
  File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56230)
  File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82313)
  File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82606)
  File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81645)
  File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78554)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74498)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75389)
  File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74691)
IOError: Error reading file '<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Facebook</title><meta name="description" content="Facebook helps you connect and share with the people in your life."

The goal is first to get the link with id=see_prevwith lxml, then using a while loop to open all comments, to finally fetch all messages in a file. Any help will be very appreciated !

目标是首先使用id=see_prevwith获取链接lxml,然后使用 while 循环打开所有评论,最后获取文件中的所有消息。任何帮助将不胜感激!

Edit: I use Python 2.7.2 on archlinux x86_64 and lxml 2.3.3.

编辑:我在 archlinux x86_64 和 lxml 2.3.3 上使用 Python 2.7.2。

采纳答案by kindall

This is your problem:

这是你的问题:

tree = etree.parse(body)

The documentationsays that "sourceis a filename or file object containing XML data." You have provided a string, so lxml is taking the text of your HTTP response body as the nameof the file you wish to open. No such file exists, so you get an IOError.

文件说:“source是一个包含XML数据的文件名或文件对象。” 您提供了一个字符串,因此 lxml 将您的 HTTP 响应正文的文本作为您要打开的文件的名称。不存在这样的文件,所以你会得到一个IOError.

The error message you get even says "Error reading file" and then gives your XML string as the name of the file it's trying to read,which is a mighty big hint about what's going on.

您收到的错误消息甚至说“错误读取文件”,然后将您的 XML 字符串作为它试图读取的文件名称,这是一个关于正在发生的事情的强大提示。

You probably want etree.XML(), which takes input from a string. Or you could just do tree = etree.parse(res)to read directly from the HTTP request into lxml (the result of opener.open()is a file-like object, and etree.parse()should be perfectly happy to consume it).

您可能想要etree.XML(),它从字符串中获取输入。或者你可以tree = etree.parse(res)直接从 HTTP 请求中读取到 lxml(结果opener.open()是一个类似文件的对象,并且etree.parse()应该非常乐意使用它)。