Python 如何在scrapy中处理302重定向

Question

提问by mrki

I am receiving a 302 response from a server while scrapping a website:

我在废弃网站时收到来自服务器的 302 响应：

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

I want to send request to GET urls instead of being redirected. Now I found this middleware:

我想将请求发送到 GET url 而不是被重定向。现在我找到了这个中间件：

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

I added this redirect code to my middleware.py file and I added this into settings.py:

我将此重定向代码添加到我的 middleware.py 文件中，并将其添加到 settings.py 中：

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?

但我仍然被重定向。为了让这个中间件正常工作，我只需要这样做吗？我错过了什么吗？

Answer 1

回答by warvariuc

I added this redirect code to my middleware.py file and I added this into settings.py:

我将此重定向代码添加到我的 middleware.py 文件中，并将其添加到 settings.py 中：

DOWNLOADER_MIDDLEWARES_BASEsays that RedirectMiddlewareis already enabled by default, so what you did didn't matter.

DOWNLOADER_MIDDLEWARES_BASE说RedirectMiddleware默认情况下已经启用了，所以你做了什么并不重要。

I want to send request to GET urls instead of being redirected.

我想将请求发送到 GET url 而不是被重定向。

How? The server responds with 302on your GETrequest. If you do GETon the same URL again you will be redirected again.

如何？服务器响应302您的GET请求。如果您GET再次使用相同的 URL，您将再次被重定向。

What are you trying to achieve?

你想达到什么目的？

If you want to not be redirected, see these questions:

如果您不想被重定向，请参阅以下问题：

Answer 2

回答by mrki

Forgot about middlewares in this scenario, this will do the trick:

在这种情况下忘记了中间件，这可以解决问题：

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

That said, you will need to include meta parameter when you yield your request:

也就是说，当您产生请求时，您需要包含元参数：

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)

Answer 3

回答by Ivan Chaer

I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].

使用HTTPCACHE_ENABLED = True. 我设法通过设置来避免这个问题HTTPCACHE_IGNORE_HTTP_CODES = [301,302]。

Answer 4

回答by Steven Almeroth

You can disable the RedirectMiddlewareby setting REDIRECT_ENABLEDto False in settings.py

您可以通过在 settings.py 中设置为 False来禁用RedirectMiddlewareREDIRECT_ENABLED

Answer 5

回答by Gallaecio

An unexplicable 302response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.

无法解释的302响应，例如从在 Web 浏览器中正常加载的页面重定向到主页或某个固定页面，通常表示服务器端针对不需要的活动采取的措施。

You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.

您必须降低抓取速度或使用智能代理（例如Crawlera）或代理轮换服务，并在收到此类响应时重试您的请求。

To retry such a response, add 'handle_httpstatus_list': [302]to the metaof the source request, and check if response.status == 302in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).

要重试这样的响应，请添加'handle_httpstatus_list': [302]到meta源请求的，并检查是否response.status == 302在回调中。如果是，请通过 yield 重试您的请求response.request.replace(dont_filter=True)。

When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:

重试时，您还应该让您的代码限制任何给定 URL 的最大重试次数。您可以保留一个字典来跟踪重试：

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

Depending on the scenario, you might want to move this code to a downloader middleware.

根据场景，您可能希望将此代码移动到下载器中间件。

Python 如何在scrapy中处理302重定向

提问by mrki

回答by warvariuc

回答by mrki

回答by Ivan Chaer

回答by Steven Almeroth

回答by Gallaecio

相关推荐

最近更新

标签

Python 如何在scrapy中处理302重定向

提问by mrki

回答by warvariuc

回答by mrki

回答by Ivan Chaer

回答by Steven Almeroth

回答by Gallaecio

相关推荐

尽管安装了 Anaconda，Mac 仍使用默认 Python

Python 除以零等于零

删除与 Brew 安装位置不同的 Python 框架文件的最安全方法是什么

Python 为什么我可以将实例方法传递给 multiprocessing.Process，而不是 multiprocessing.Pool？

相关推荐

最近更新

标签