使用 Python 请求获取 html？

Question

提问by Rich Thompson

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

我正在尝试自学一些基本的网络抓取。使用 Python 的 requests 模块，我能够抓取各种网站的 html，直到我尝试了这个：

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

Instead of the basic html that is the source for this page, I get:

我得到的不是作为此页面源的基本 html，而是：

>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...

>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

我已经尝试了多种 get/post 组合，以及我可以从文档、SO 和其他示例中猜测的每种语法。我不明白我在上面看到的是什么，无法将它变成我可以阅读的任何东西，也无法弄清楚如何获得我真正想要的东西。我的问题是，如何获取上述页面的 html？

Answer 1

采纳答案by Martijn Pieters

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

有问题的服务器正在为您提供gzip 响应。服务器也很破；它发送以下标头：

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..>line there is not a valid HTTP header. As such, the remaining headers past Serverare ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.pyis a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

该<!DOCTYPE..>行没有有效的 HTTP 标头。因此，剩余的头过去Server被忽略。不清楚为什么服务器会插话；在所有可能的引擎盖中WRCCWrappers.py是一个 CGI 脚本，它不输出标题，但在 doctype 行之后包含一个双换行符，欺骗 Apache 服务器在那里插入额外的标题。

As such, requestsalso doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

因此，requests也不会检测到数据是 gzip 编码的。数据就在那里，你只需要解码它。或者你可以，如果它不是相当不完整。

The work-around is to tell the server not to bother with compression:

解决方法是告诉服务器不要打扰压缩：

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

并返回未压缩的响应。

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

顺便说一句，在 Python 2 上，HTTP 标头解析器不是那么严格，并且设法将 doctype 声明为标头：

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encodinginformation survives, so there requestsdecodes the content for you, as expected.

并且content-encoding信息仍然存在，因此requests可以按预期为您解码内容。

Answer 2

回答by Grant

The HTTP headers for this URL have now been fixed.

此 URL 的 HTTP 标头现已修复。

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

使用 Python 请求获取 html？

提问by Rich Thompson

采纳答案by Martijn Pieters

回答by Grant

相关推荐

最近更新

标签

使用 Python 请求获取 html？

提问by Rich Thompson

采纳答案by Martijn Pieters

回答by Grant

相关推荐

Python Google API：使用 oauth2client.client 从刷新令牌中获取凭据

Python 在 Django 中查看权限

如何在python中绘制没有曲线的单个点？

使用 Python 绑定在 Selenium 中发送键控制 + 单击

相关推荐

最近更新

标签