Python 被 robots.txt 禁止：scrapy

Question

提问by deepak kumar

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/>

在抓取像https://www.netflix.com这样的网站时，被 robots.txt 禁止：https://www.netflix.com/>

ERROR: No response downloaded for: https://www.netflix.com/

Answer 1

回答by Rafael Almeida

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.pywith ROBOTSTXT_OBEY

在2016-05-11推出的新版本（scrapy 1.1）中，爬取先下载robots.txt再爬取。settings.py使用ROBOTSTXT_OBEY更改此行为更改

ROBOTSTXT_OBEY=False

Here are the release notes

这是发行说明

Answer 2

回答by Ketan Patel

First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure.

您需要确保的第一件事是更改请求中的用户代理，否则默认用户代理肯定会被阻止。

Answer 3

回答by CubeOfCheese

Netflix's Terms of Use state:

Netflix 的使用条款规定：

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

您还同意不规避、删除、更改、停用、降级或阻挠 Netflix 服务中的任何内容保护；使用任何机器人、蜘蛛、爬虫或其他自动化方式访问 Netflix 服务；

They have their robots.txt set up to block web scrapers. If you override the setting in settings.pyto ROBOTSTXT_OBEY=Falsethen you are violating their terms of use which can result in a law suit.

他们设置了robots.txt来阻止网络爬虫。如果您覆盖设置为settings.py，ROBOTSTXT_OBEY=False那么您违反了他们的使用条款，这可能会导致诉讼。

Python 被 robots.txt 禁止：scrapy

提问by deepak kumar

回答by Rafael Almeida

回答by Ketan Patel

回答by CubeOfCheese

相关推荐

最近更新

标签

Python 被 robots.txt 禁止：scrapy

提问by deepak kumar

回答by Rafael Almeida

回答by Ketan Patel

回答by CubeOfCheese

相关推荐

Python - 找不到模块

Python PyCharm：进程已完成，退出代码为 0

Python 如何向散点图添加一条最佳拟合线

如何通过conda安装我自己的python模块（包）并观察它的变化

相关推荐

最近更新

标签