Python 如何向 Scrapy CrawlSpider 请求添加标头？

Question

提问by CatShoes

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

我正在使用 CrawlSpider 类来抓取网站，我想修改每个请求中发送的标头。具体来说，我想将引用添加到请求中。

As per this question, I checked

根据这个问题，我检查了

response.request.headers.get('Referer', None)

in my response parsing function and the Refererheader is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).

在我的响应解析函数中，Referer标题不存在。我认为这意味着请求中没有提交 Referer（除非网站没有返回它，我不确定）。

I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_followor specifying a process_requestcallback for a rule will not work because the referer is not in scope at those times.

我一直无法弄清楚如何修改请求的标头。同样，我的蜘蛛是从 CrawlSpider 派生的。覆盖 CrawlSpider_requests_to_follow或process_request为规则指定回调将不起作用，因为此时引用者不在范围内。

Does anyone know how to modify request headers dynamically?

有谁知道如何动态修改请求标头？

Answer 1

采纳答案by CatShoes

I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentationfor scrapy.contrib.spidermiddleware.referer.RefererMiddleware

我讨厌回答我自己的问题，但我发现了如何去做。您必须启用 SpiderMiddleware 来填充引用以进行响应。查看文档的scrapy.contrib.spidermiddleware.referer.RefererMiddleware

In short, you need to add this middleware to your project's settings file.

简而言之，您需要将此中间件添加到您项目的设置文件中。

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

Then in your response parsing method you can use, response.request.headers.get('Referrer', None), to get the referer.

然后在您的响应解析方法中，您可以使用response.request.headers.get('Referrer', None), 来获取引用者。

If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.

如果你马上理解这些中间件，再读一遍，休息一下，然后再读一遍。我发现它们非常令人困惑。

Answer 2

回答by warvariuc

You can pass REFERERmanually to each requestusing headersargument:

您可以使用参数REFERER手动传递给每个请求headers：

yield Request(parse=..., headers={'referer':...})

RefererMiddleware does the same, automatically taking the referrer url from the previous response.

RefererMiddleware做同样的事情，自动从上一个响应中获取引用 URL。

Python 如何向 Scrapy CrawlSpider 请求添加标头？

提问by CatShoes

采纳答案by CatShoes

回答by warvariuc

相关推荐

最近更新

标签

Python 如何向 Scrapy CrawlSpider 请求添加标头？

提问by CatShoes

采纳答案by CatShoes

回答by warvariuc

相关推荐

Python 导入错误：没有名为 django.core.wsgi 的模块用于 uwsgi

Python 如何处理 Django 中的数据库异常

使用 Python 将 Blob 从 SQLite 写入文件

Python从excel数据创建字典

相关推荐

最近更新

标签