在urllib2中缓存?
使用urllib2时,是否有一种简便的方法来缓存我所忽略的内容,还是我必须自己滚动?
解决方案
此ActiveState Python配方可能会有所帮助:
http://code.activestate.com/recipes/491261/
如果我们不介意以较低的级别工作,则httplib2(https://github.com/httplib2/httplib2)是一个出色的HTTP库,其中包含缓存功能。
我们可以使用装饰器函数,例如:
class cache(object): def __init__(self, fun): self.fun = fun self.cache = {} def __call__(self, *args, **kwargs): key = str(args) + str(kwargs) try: return self.cache[key] except KeyError: self.cache[key] = rval = self.fun(*args, **kwargs) return rval except TypeError: # incase key isn't a valid key - don't cache return self.fun(*args, **kwargs)
并按照以下方式定义函数:
@cache def get_url_src(url): return urllib.urlopen(url).read()
这是假设我们不关注HTTP缓存控件,而只是想在应用程序期间缓存页面
我在寻找类似的东西,并遇到了danivo发布的"食谱491261:urllib2的缓存和限制"。问题是我真的不喜欢缓存代码(很多重复,很多手动连接文件路径而不是使用os.path.join,使用静态方法,非非常PEP8'sih以及我尝试避免的其他方式)
该代码稍微好一点(无论如何,以我的观点),并且在功能上几乎是相同的,只是增加了一些主要是" recache"方法(示例用法可以在这里看到,或者在"如果__name__ ==" main"中出现:)。代码末尾的部分)。
最新版本可以在http://github.com/dbr/tvdb_api/blob/master/cache.py中找到,我将在此处粘贴以方便后代使用(已删除我的应用程序专用标头):
#!/usr/bin/env python """ urllib2 caching handler Modified from http://code.activestate.com/recipes/491261/ by dbr """ import os import time import httplib import urllib2 import StringIO from hashlib import md5 def calculate_cache_path(cache_location, url): """Checks if [cache_location]/[hash_of_url].headers and .body exist """ thumb = md5(url).hexdigest() header = os.path.join(cache_location, thumb + ".headers") body = os.path.join(cache_location, thumb + ".body") return header, body def check_cache_time(path, max_age): """Checks if a file has been created/modified in the [last max_age] seconds. False means the file is too old (or doesn't exist), True means it is up-to-date and valid""" if not os.path.isfile(path): return False cache_modified_time = os.stat(path).st_mtime time_now = time.time() if cache_modified_time < time_now - max_age: # Cache is old return False else: return True def exists_in_cache(cache_location, url, max_age): """Returns if header AND body cache file exist (and are up-to-date)""" hpath, bpath = calculate_cache_path(cache_location, url) if os.path.exists(hpath) and os.path.exists(bpath): return( check_cache_time(hpath, max_age) and check_cache_time(bpath, max_age) ) else: # File does not exist return False def store_in_cache(cache_location, url, response): """Tries to store response in cache.""" hpath, bpath = calculate_cache_path(cache_location, url) try: outf = open(hpath, "w") headers = str(response.info()) outf.write(headers) outf.close() outf = open(bpath, "w") outf.write(response.read()) outf.close() except IOError: return True else: return False class CacheHandler(urllib2.BaseHandler): """Stores responses in a persistant on-disk cache. If a subsequent GET request is made for the same URL, the stored response is returned, saving time, resources and bandwidth """ def __init__(self, cache_location, max_age = 21600): """The location of the cache directory""" self.max_age = max_age self.cache_location = cache_location if not os.path.exists(self.cache_location): os.mkdir(self.cache_location) def default_open(self, request): """Handles GET requests, if the response is cached it returns it """ if request.get_method() is not "GET": return None # let the next handler try to handle the request if exists_in_cache( self.cache_location, request.get_full_url(), self.max_age ): return CachedResponse( self.cache_location, request.get_full_url(), set_cache_header = True ) else: return None def http_response(self, request, response): """Gets a HTTP response, if it was a GET request and the status code starts with 2 (200 OK etc) it caches it and returns a CachedResponse """ if (request.get_method() == "GET" and str(response.code).startswith("2") ): if 'x-local-cache' not in response.info(): # Response is not cached set_cache_header = store_in_cache( self.cache_location, request.get_full_url(), response ) else: set_cache_header = True #end if x-cache in response return CachedResponse( self.cache_location, request.get_full_url(), set_cache_header = set_cache_header ) else: return response class CachedResponse(StringIO.StringIO): """An urllib2.response-like object for cached responses. To determine if a response is cached or coming directly from the network, check the x-local-cache header rather than the object type. """ def __init__(self, cache_location, url, set_cache_header=True): self.cache_location = cache_location hpath, bpath = calculate_cache_path(cache_location, url) StringIO.StringIO.__init__(self, file(bpath).read()) self.url = url self.code = 200 self.msg = "OK" headerbuf = file(hpath).read() if set_cache_header: headerbuf += "x-local-cache: %s\r\n" % (bpath) self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf)) def info(self): """Returns headers """ return self.headers def geturl(self): """Returns original URL """ return self.url def recache(self): new_request = urllib2.urlopen(self.url) set_cache_header = store_in_cache( self.cache_location, new_request.url, new_request ) CachedResponse.__init__(self, self.cache_location, self.url, True) if __name__ == "__main__": def main(): """Quick test/example of CacheHandler""" opener = urllib2.build_opener(CacheHandler("/tmp/")) response = opener.open("http://google.com") print response.headers print "Response:", response.read() response.recache() print response.headers print "After recache:", response.read() main()
Yahoo开发人员网络上的这篇文章http://developer.yahoo.com/python/python-caching.html描述了如何将通过urllib发出的http调用缓存到内存或者磁盘。
我一直在使用httplib2(它在处理HTTP缓存和身份验证方面做得很出色)与urllib2(在stdlib中,具有可扩展的接口并支持HTTP代理服务器)之间陷入困境。
ActiveState配方开始向urllib2添加缓存支持,但仅以非常原始的方式。它无法实现存储机制的可扩展性,无法对文件系统支持的存储进行硬编码。它还不支持HTTP缓存头。
为了将httplib2缓存的最佳功能和urllib2的可扩展性结合在一起,我对ActiveState配方进行了修改,以实现与httplib2相同的大多数缓存功能。该模块位于jaraco.net中,名称为jaraco.net.http.caching。该链接指向撰写本文时所存在的模块。虽然该模块当前是较大的jaraco.net程序包的一部分,但它不具有程序包内依赖性,因此请随时将其拉出并在我们自己的项目中使用。
另外,如果我们具有Python 2.6或者更高版本,则可以" easy_install jaraco.net> = 1.3",然后将CachingHandler与" caching.quick_test()"中的代码一起使用。
"""Quick test/example of CacheHandler""" import logging import urllib2 from httplib2 import FileCache from jaraco.net.http.caching import CacheHandler logging.basicConfig(level=logging.DEBUG) store = FileCache(".cache") opener = urllib2.build_opener(CacheHandler(store)) urllib2.install_opener(opener) response = opener.open("http://www.google.com/") print response.headers print "Response:", response.read()[:100], '...\n' response.reload(store) print response.headers print "After reload:", response.read()[:100], '...\n'
请注意,jaraco.util.http.caching没有提供有关缓存的后备存储的规范,而是遵循httplib2使用的接口。因此,httplib2.FileCache可以直接与urllib2和CacheHandler一起使用。同样,为Cachelib设计的其他为httplib2设计的后备缓存也应可用。
@dbr:我们可能还需要添加带有以下内容的https响应缓存:
def https_response(self, request, response): return self.http_response(request,response)