Python 强迫我的爬虫蜘蛛停止爬行

Question

提问by no1

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urlsbut I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

是否有机会在特定条件为真时停止爬行（例如 scrap_item_id == predefine_value ）。我的问题与Scrapy类似- 如何识别已抓取的网址，但我想在发现最后一个抓取的项目后“强制”我的抓取蜘蛛停止抓取。

Answer 1

回答by alukach

This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.

这个问题是 8 个月前提出的，但我想知道同样的事情，并找到了另一个（不是很好）的解决方案。希望这可以帮助未来的读者。

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:

我正在连接到我的管道文件中的数据库，如果数据库连接不成功，我希望 Spider 停止爬行（如果无处发送数据，则收集数据没有意义）。我最终做的是使用：

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

This causes the Spider to do the following:

这会导致 Spider 执行以下操作：

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...

在阅读您的评论并查看“/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py”文件后，我只是将它们拼凑在一起. 我不完全确定它在做什么，传递给函数的第一个数字是签名（例如，使用 3,0 而不是 9,0 返回错误[scrapy] INFO: Received SIGKILL...

Seems to work well enough though. Happy scraping.

不过似乎工作得很好。快乐刮痧。

EDIT: I also suppose that you could just force your program to shut down with something like:

编辑：我还认为您可以通过以下方式强制您的程序关闭：

import sys
sys.exit("SHUT DOWN EVERYTHING!")

Answer 2

回答by Sjaak Trekhaak

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

在 GitHub 上提供的最新版本的 Scrapy 中，您可以引发 CloseSpider 异常以手动关闭蜘蛛。

In the 0.14 release note docis mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

在0.14 发行说明文档中提到：“添加 CloseSpider 异常以手动关闭蜘蛛 (r2691)”

Example as per the docs:

根据文档的示例：

def parse_page(self, response):
  if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

另见：http: //readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

Answer 3

回答by Macbric

From a pipeline, I prefer the following solution.

从管道中，我更喜欢以下解决方案。

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

来源：强制蜘蛛停止在scrapy

Python 强迫我的爬虫蜘蛛停止爬行

提问by no1

回答by alukach

回答by Sjaak Trekhaak

回答by Macbric

相关推荐

最近更新

标签

Python 强迫我的爬虫蜘蛛停止爬行

提问by no1

回答by alukach

回答by Sjaak Trekhaak

回答by Macbric

相关推荐

在 tearDown() 方法中获取 Python 的单元测试结果

将常规 Python 字符串转换为原始字符串

python theading.Timer：如何将参数传递给回调？

更改python shell的背景颜色

相关推荐

最近更新

标签