使用 Python 请求获取 html?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27803503/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Get html using Python requests?
提问by Rich Thompson
I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:
我正在尝试自学一些基本的网络抓取。使用 Python 的 requests 模块,我能够抓取各种网站的 html,直到我尝试了这个:
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
Instead of the basic html that is the source for this page, I get:
我得到的不是作为此页面源的基本 html,而是:
>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...
>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...
I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?
我已经尝试了多种 get/post 组合,以及我可以从文档、SO 和其他示例中猜测的每种语法。我不明白我在上面看到的是什么,无法将它变成我可以阅读的任何东西,也无法弄清楚如何获得我真正想要的东西。我的问题是,如何获取上述页面的 html?
采纳答案by Martijn Pieters
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
有问题的服务器正在为您提供gzip 响应。服务器也很破;它发送以下标头:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..>
line there is not a valid HTTP header. As such, the remaining headers past Server
are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py
is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
该<!DOCTYPE..>
行没有有效的 HTTP 标头。因此,剩余的头过去Server
被忽略。不清楚为什么服务器会插话;在所有可能的引擎盖中WRCCWrappers.py
是一个 CGI 脚本,它不输出标题,但在 doctype 行之后包含一个双换行符,欺骗 Apache 服务器在那里插入额外的标题。
As such, requests
also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
因此,requests
也不会检测到数据是 gzip 编码的。数据就在那里,你只需要解码它。或者你可以,如果它不是相当不完整。
The work-around is to tell the server not to bother with compression:
解决方法是告诉服务器不要打扰压缩:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
并返回未压缩的响应。
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
顺便说一句,在 Python 2 上,HTTP 标头解析器不是那么严格,并且设法将 doctype 声明为标头:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding
information survives, so there requests
decodes the content for you, as expected.
并且content-encoding
信息仍然存在,因此requests
可以按预期为您解码内容。
回答by Grant
The HTTP headers for this URL have now been fixed.
此 URL 的 HTTP 标头现已修复。
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}