Python 如何在scrapy中处理302重定向
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22795416/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to handle 302 redirect in scrapy
提问by mrki
I am receiving a 302 response from a server while scrapping a website:
我在废弃网站时收到来自服务器的 302 响应:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
I want to send request to GET urls instead of being redirected. Now I found this middleware:
我想将请求发送到 GET url 而不是被重定向。现在我找到了这个中间件:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
I added this redirect code to my middleware.py file and I added this into settings.py:
我将此重定向代码添加到我的 middleware.py 文件中,并将其添加到 settings.py 中:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?
但我仍然被重定向。为了让这个中间件正常工作,我只需要这样做吗?我错过了什么吗?
回答by warvariuc
I added this redirect code to my middleware.py file and I added this into settings.py:
我将此重定向代码添加到我的 middleware.py 文件中,并将其添加到 settings.py 中:
DOWNLOADER_MIDDLEWARES_BASEsays that RedirectMiddlewareis already enabled by default, so what you did didn't matter.
DOWNLOADER_MIDDLEWARES_BASE说RedirectMiddleware默认情况下已经启用了,所以你做了什么并不重要。
I want to send request to GET urls instead of being redirected.
我想将请求发送到 GET url 而不是被重定向。
How? The server responds with 302on your GETrequest. If you do GETon the same URL again you will be redirected again.
如何?服务器响应302您的GET请求。如果您GET再次使用相同的 URL,您将再次被重定向。
What are you trying to achieve?
你想达到什么目的?
If you want to not be redirected, see these questions:
如果您不想被重定向,请参阅以下问题:
回答by mrki
Forgot about middlewares in this scenario, this will do the trick:
在这种情况下忘记了中间件,这可以解决问题:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
That said, you will need to include meta parameter when you yield your request:
也就是说,当您产生请求时,您需要包含元参数:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
回答by Ivan Chaer
I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].
使用HTTPCACHE_ENABLED = True. 我设法通过设置来避免这个问题HTTPCACHE_IGNORE_HTTP_CODES = [301,302]。
回答by Steven Almeroth
You can disable the RedirectMiddlewareby setting REDIRECT_ENABLEDto False in settings.py
您可以通过在 settings.py 中设置为 False来禁用RedirectMiddlewareREDIRECT_ENABLED
回答by Gallaecio
An unexplicable 302response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.
无法解释的302响应,例如从在 Web 浏览器中正常加载的页面重定向到主页或某个固定页面,通常表示服务器端针对不需要的活动采取的措施。
You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.
您必须降低抓取速度或使用智能代理(例如Crawlera)或代理轮换服务,并在收到此类响应时重试您的请求。
To retry such a response, add 'handle_httpstatus_list': [302]to the metaof the source request, and check if response.status == 302in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).
要重试这样的响应,请添加'handle_httpstatus_list': [302]到meta源请求的 ,并检查是否response.status == 302在回调中。如果是,请通过 yield 重试您的请求response.request.replace(dont_filter=True)。
When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:
重试时,您还应该让您的代码限制任何给定 URL 的最大重试次数。您可以保留一个字典来跟踪重试:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
Depending on the scenario, you might want to move this code to a downloader middleware.
根据场景,您可能希望将此代码移动到下载器中间件。

