python urllib2 是否会自动解压缩从网页获取的 gzip 数据？

Question

提问by mlzboy

I'm using

我正在使用

 data=urllib2.urlopen(url).read()

I want to know:

我想知道：

How can I tell if the data at a URL is gzipped?
Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

如何判断 URL 中的数据是否被 gzip 压缩？
如果数据被 gzip 压缩，urllib2 是否会自动解压缩数据？数据总是字符串吗？

Answer 1

采纳答案by ars

How can I tell if the data at a URL is gzipped?

如何判断 URL 中的数据是否被 gzip 压缩？

This checks if the content is gzipped and decompresses it:

这会检查内容是否被 gzip 压缩并解压缩：

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO(response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()

Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

如果数据被 gzip 压缩，urllib2 是否会自动解压缩数据？数据总是字符串吗？

No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')

不。urllib2 不会自动解压缩数据，因为 'Accept-Encoding' 标头不是由 urllib2 设置的，而是由您使用的： request.add_header('Accept-Encoding','gzip, deflate')

Answer 2

回答by bobince

If you are talking about a simple .gzfile, no, urllib2 will not decode it, you will get the unchanged .gzfile as output.

如果你在谈论一个简单的.gz文件，不，urllib2 不会解码它，你会得到未更改的.gz文件作为输出。

If you are talking about automatic HTTP-level compression using Content-Encoding: gzipor deflate, then that has to be deliberately requested by the client using an Accept-Encodingheader.

如果您正在谈论使用Content-Encoding: gzip或的自动 HTTP 级压缩deflate，那么客户端必须使用Accept-Encoding标头故意请求它。

urllib2 doesn't set this header, so the response it gets back will not be compressed. You can safely fetch the resource without having to worry about compression (though since compression isn't supported the request may take longer).

urllib2 不设置此标头，因此它返回的响应不会被压缩。您可以安全地获取资源而不必担心压缩（尽管由于不支持压缩，请求可能需要更长的时间）。

Answer 3

回答by RuiDC

Your question has been answered, but for a more comprehensive implementation, take a look at Mark Pilgrim's implementation of this, it covers gzip, deflate, safe URL parsing and much, much more, for a widely-used RSS parser, but nevertheless a useful reference.

您的问题已得到解答，但要获得更全面的实现，请查看Mark Pilgrim 对 this 的实现，它涵盖了 gzip、deflate、安全 URL 解析等等，对于广泛使用的 RSS 解析器而言，它仍然是一个有用的参考。

Answer 4

回答by RobotHumans

It appears urllib3 handles this automatically now.

看来 urllib3 现在会自动处理这个问题。

Reference headers:

参考标题：

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', '服务器': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': ' 1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible' : 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})

Reference code:

参考代码：

import gzip
import io
import urllib3

class EDDBMultiDataFetcher():
    def __init__(self):
        self.files_dict = {
            'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
            'Stations':'http://eddb.io/archive/v5/stations.jsonl',
            'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
            'Commodities':'http://eddb.io/archive/v5/commodities.json'
            }
        self.http = urllib3.PoolManager()
    def fetch_all(self):
        for item, url in self.files_dict.items():
            self.fetch(item, url)

    def fetch(self, item, url, save_file = None):
        print("Fetching: " + item)
        request = self.http.request(
            'GET',
            url,
            headers={
                'Accept-encoding': 'gzip, deflate, sdch'
                })
        data = request.data.decode('utf-8')
        print("Fetch complete")
        print(data)
        print(request.headers)
        quit()


if __name__ == '__main__':
    print("Fetching files from eddb.io")
    fetcher = EDDBMultiDataFetcher()
    fetcher.fetch_all()

python urllib2 是否会自动解压缩从网页获取的 gzip 数据？

提问by mlzboy

采纳答案by ars

回答by bobince

回答by RuiDC

回答by RobotHumans

相关推荐

最近更新

标签

python urllib2 是否会自动解压缩从网页获取的 gzip 数据？

提问by mlzboy

采纳答案by ars

回答by bobince

回答by RuiDC

回答by RobotHumans

相关推荐

Python 绘制熊猫中每个唯一值计数的键计数

Python 从用户获取数字并打印最大值和最小值（不使用内置函数）

Python 如何打印元组列表

Python 中的迭代器 (iter()) 函数。

相关推荐

最近更新

标签