Python 强迫我的爬虫蜘蛛停止爬行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4448724/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Force my scrapy spider to stop crawling
提问by no1
is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urlsbut I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.
是否有机会在特定条件为真时停止爬行(例如 scrap_item_id == predefine_value )。我的问题与Scrapy类似- 如何识别已抓取的网址,但我想在发现最后一个抓取的项目后“强制”我的抓取蜘蛛停止抓取。
回答by alukach
This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.
这个问题是 8 个月前提出的,但我想知道同样的事情,并找到了另一个(不是很好)的解决方案。希望这可以帮助未来的读者。
I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:
我正在连接到我的管道文件中的数据库,如果数据库连接不成功,我希望 Spider 停止爬行(如果无处发送数据,则收集数据没有意义)。我最终做的是使用:
from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.
This causes the Spider to do the following:
这会导致 Spider 执行以下操作:
[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.
I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...
在阅读您的评论并查看“/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py”文件后,我只是将它们拼凑在一起. 我不完全确定它在做什么,传递给函数的第一个数字是签名(例如,使用 3,0 而不是 9,0 返回错误[scrapy] INFO: Received SIGKILL...
Seems to work well enough though. Happy scraping.
不过似乎工作得很好。快乐刮痧。
EDIT: I also suppose that you could just force your program to shut down with something like:
编辑:我还认为您可以通过以下方式强制您的程序关闭:
import sys
sys.exit("SHUT DOWN EVERYTHING!")
回答by Sjaak Trekhaak
In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.
在 GitHub 上提供的最新版本的 Scrapy 中,您可以引发 CloseSpider 异常以手动关闭蜘蛛。
In the 0.14 release note docis mentioned: "Added CloseSpider exception to manually close spiders (r2691)"
在0.14 发行说明文档中提到:“添加 CloseSpider 异常以手动关闭蜘蛛 (r2691)”
Example as per the docs:
根据文档的示例:
def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider
另见:http: //readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider
回答by Macbric
From a pipeline, I prefer the following solution.
从管道中,我更喜欢以下解决方案。
class MongoDBPipeline(object):
def process_item(self, item, spider):
spider.crawler.engine.close_spider(self, reason='duplicate')
Source: Force spider to stop in scrapy

