python urllib2 是否会自动解压缩从网页获取的 gzip 数据?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3947120/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Does python urllib2 automatically uncompress gzip data fetched from webpage?
提问by mlzboy
I'm using
我正在使用
data=urllib2.urlopen(url).read()
I want to know:
我想知道:
How can I tell if the data at a URL is gzipped?
Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?
如何判断 URL 中的数据是否被 gzip 压缩?
如果数据被 gzip 压缩,urllib2 是否会自动解压缩数据?数据总是字符串吗?
采纳答案by ars
- How can I tell if the data at a URL is gzipped?
- 如何判断 URL 中的数据是否被 gzip 压缩?
This checks if the content is gzipped and decompresses it:
这会检查内容是否被 gzip 压缩并解压缩:
from StringIO import StringIO
import gzip
request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
- Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?
- 如果数据被 gzip 压缩,urllib2 是否会自动解压缩数据?数据总是字符串吗?
No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')
不。urllib2 不会自动解压缩数据,因为 'Accept-Encoding' 标头不是由 urllib2 设置的,而是由您使用的: request.add_header('Accept-Encoding','gzip, deflate')
回答by bobince
If you are talking about a simple .gzfile, no, urllib2 will not decode it, you will get the unchanged .gzfile as output.
如果你在谈论一个简单的.gz文件,不,urllib2 不会解码它,你会得到未更改的.gz文件作为输出。
If you are talking about automatic HTTP-level compression using Content-Encoding: gzipor deflate, then that has to be deliberately requested by the client using an Accept-Encodingheader.
如果您正在谈论使用Content-Encoding: gzip或 的自动 HTTP 级压缩deflate,那么客户端必须使用Accept-Encoding标头故意请求它。
urllib2 doesn't set this header, so the response it gets back will not be compressed. You can safely fetch the resource without having to worry about compression (though since compression isn't supported the request may take longer).
urllib2 不设置此标头,因此它返回的响应不会被压缩。您可以安全地获取资源而不必担心压缩(尽管由于不支持压缩,请求可能需要更长的时间)。
回答by RuiDC
Your question has been answered, but for a more comprehensive implementation, take a look at Mark Pilgrim's implementation of this, it covers gzip, deflate, safe URL parsing and much, much more, for a widely-used RSS parser, but nevertheless a useful reference.
您的问题已得到解答,但要获得更全面的实现,请查看Mark Pilgrim 对 this 的实现,它涵盖了 gzip、deflate、安全 URL 解析等等,对于广泛使用的 RSS 解析器而言,它仍然是一个有用的参考。
回答by RobotHumans
It appears urllib3 handles this automatically now.
看来 urllib3 现在会自动处理这个问题。
Reference headers:
参考标题:
HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})
HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', '服务器': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': ' 1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible' : 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})
Reference code:
参考代码:
import gzip
import io
import urllib3
class EDDBMultiDataFetcher():
def __init__(self):
self.files_dict = {
'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
'Stations':'http://eddb.io/archive/v5/stations.jsonl',
'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
'Commodities':'http://eddb.io/archive/v5/commodities.json'
}
self.http = urllib3.PoolManager()
def fetch_all(self):
for item, url in self.files_dict.items():
self.fetch(item, url)
def fetch(self, item, url, save_file = None):
print("Fetching: " + item)
request = self.http.request(
'GET',
url,
headers={
'Accept-encoding': 'gzip, deflate, sdch'
})
data = request.data.decode('utf-8')
print("Fetch complete")
print(data)
print(request.headers)
quit()
if __name__ == '__main__':
print("Fetching files from eddb.io")
fetcher = EDDBMultiDataFetcher()
fetcher.fetch_all()

