Python 如何绕过 Scrapy 中的 cloudflare bot/ddos 保护?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33247662/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to bypass cloudflare bot/ddos protection in Scrapy?
提问by Kulbi
I used to scrape e-commerce webpage occasionally to get product prices information. I have not used the scraper built using Scrapyin a while and yesterday was trying to use it - I run into a problem with bot protection.
我曾经偶尔抓取电子商务网页以获取产品价格信息。我有一段时间没有使用使用Scrapy构建的刮板,昨天试图使用它 - 我遇到了机器人保护问题。
It is using CloudFlare's DDOS protection which is basically using JavaScript evaluation to filter out the browsers (and therefore scrapers) with JS disabled. Once the function is evaluated, the response with calculated number is generated. In return, service sends back two authentication cookies which attached to each request allow to normally crawl the site. Here's the description of how it works.
它使用 CloudFlare 的 DDOS 保护,它基本上使用 JavaScript 评估来过滤禁用 JS 的浏览器(以及因此抓取工具)。一旦对函数求值,就会生成带有计算数字的响应。作为回报,服务发回两个身份验证 cookie,这些 cookie 附加到每个请求,允许正常抓取站点。下面是它如何工作的描述。
I have also found a cloudflare-scrapePython module that uses external JS evaluation engine to calculate the number and send the request back to server. I'm not sure how to integrate it into Scrapythough. Or maybe there's a smarter way without using JS execution? In the end, it's a form...
我还发现了一个cloudflare-scrapePython 模块,它使用外部 JS 评估引擎来计算数量并将请求发送回服务器。我不确定如何将它集成到Scrapy 中。或者也许有更聪明的方法而不使用 JS 执行?最后,它是一种形式......
I'd apriciate any help.
我愿意提供任何帮助。
采纳答案by Kulbi
So I executed JavaScript using Python with help of cloudflare-scrape.
所以我在cloudflare-scrape 的帮助下使用 Python 执行了 JavaScript 。
To your scraper, you need to add the following code:
在你的scraper中,你需要添加以下代码:
def start_requests(self):
for url in self.start_urls:
token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_')
yield Request(url=url, cookies=token, headers={'User-Agent': agent})
alongside parsing functions. And that's it!
与解析函数一起。就是这样!
Of course, you need to install cloudflare-scrape first and import it to your spider. You also need a JS execution engine installed. I had Node.JS already, no complaints.
当然,您需要先安装 cloudflare-scrape 并将其导入您的蜘蛛。您还需要安装一个 JS 执行引擎。我已经有了 Node.JS,没有抱怨。
回答by mjsa
Obviously the best way to do this would be to whitelist your IP in CloudFlare; if this isn't suitable let me recommend the cloudflare-scrapelibrary. You can use this to get the cookie token, then provide this cookie token in your Scrapy requestback to the server.
显然,最好的方法是在 CloudFlare 中将您的 IP 列入白名单;如果这不合适,让我推荐cloudflare-scrape库。您可以使用它来获取 cookie 令牌,然后在您的Scrapy 请求中将此 cookie 令牌提供回服务器。