Python 被 robots.txt 禁止:scrapy

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37274835/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:09:30  来源:igfitidea点击:

getting Forbidden by robots.txt: scrapy

pythonscrapyweb-crawler

提问by deepak kumar

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

在抓取像https://www.netflix.com这样的网站时,被 robots.txt 禁止:https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

错误:没有响应下载:https: //www.netflix.com/

回答by Rafael Almeida

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.pywith ROBOTSTXT_OBEY

在2016-05-11推出的新版本(scrapy 1.1)中,爬取先下载robots.txt再爬取。settings.py使用ROBOTSTXT_OBEY更改此行为更改

ROBOTSTXT_OBEY=False

Here are the release notes

这是发行说明

回答by Ketan Patel

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

您需要确保的第一件事是更改请求中的用户代理,否则默认用户代理肯定会被阻止。

回答by CubeOfCheese

Netflix's Terms of Use state:

Netflix 的使用条款规定:

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

您还同意不规避、删除、更改、停用、降级或阻挠 Netflix 服务中的任何内容保护;使用任何机器人、蜘蛛、爬虫或其他自动化方式访问 Netflix 服务;

They have their robots.txt set up to block web scrapers. If you override the setting in settings.pyto ROBOTSTXT_OBEY=Falsethen you are violating their terms of use which can result in a law suit.

他们设置了robots.txt来阻止网络爬虫。如果您覆盖设置为settings.pyROBOTSTXT_OBEY=False那么您违反了他们的使用条款,这可能会导致诉讼。