Python 如何向 Scrapy CrawlSpider 请求添加标头?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14220174/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 10:47:37  来源:igfitidea点击:

How to add Headers to Scrapy CrawlSpider Requests?

pythonscrapy

提问by CatShoes

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

我正在使用 CrawlSpider 类来抓取网站,我想修改每个请求中发送的标头。具体来说,我想将引用添加到请求中。

As per this question, I checked

根据这个问题,我检查了

response.request.headers.get('Referer', None)

in my response parsing function and the Refererheader is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).

在我的响应解析函数中,Referer标题不存在。我认为这意味着请求中没有提交 Referer(除非网站没有返回它,我不确定)。

I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_followor specifying a process_requestcallback for a rule will not work because the referer is not in scope at those times.

我一直无法弄清楚如何修改请求的标头。同样,我的蜘蛛是从 CrawlSpider 派生的。覆盖 CrawlSpider_requests_to_followprocess_request为规则指定回调将不起作用,因为此时引用者不在范围内。

Does anyone know how to modify request headers dynamically?

有谁知道如何动态修改请求标头?

采纳答案by CatShoes

I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentationfor scrapy.contrib.spidermiddleware.referer.RefererMiddleware

我讨厌回答我自己的问题,但我发现了如何去做。您必须启用 SpiderMiddleware 来填充引用以进行响应。查看文档scrapy.contrib.spidermiddleware.referer.RefererMiddleware

In short, you need to add this middleware to your project's settings file.

简而言之,您需要将此中间件添加到您项目的设置文件中。

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

Then in your response parsing method you can use, response.request.headers.get('Referrer', None), to get the referer.

然后在您的响应解析方法中,您可以使用response.request.headers.get('Referrer', None), 来获取引用者。

If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.

如果你马上理解这些中间件,再读一遍,休息一下,然后再读一遍。我发现它们非常令人困惑。

回答by warvariuc

You can pass REFERERmanually to each requestusing headersargument:

您可以使用参数REFERER手动传递给每个请求headers

yield Request(parse=..., headers={'referer':...})

RefererMiddleware does the same, automatically taking the referrer url from the previous response.

RefererMiddleware做同样的事情,自动从上一个响应中获取引用 URL。