Python 从脚本scrapy运行蜘蛛

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21662689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:21:41  来源:igfitidea点击:

scrapy run spider from script

pythonpython-2.7scrapy

提问by Marco Dinatsoli

I want to run my spider from a script rather than a scrap crawl

我想从脚本而不是从脚本运行我的蜘蛛 scrap crawl

I found this page

我找到了这个页面

http://doc.scrapy.org/en/latest/topics/practices.html

http://doc.scrapy.org/en/latest/topics/practices.html

but actually it doesn't say where to put that script.

但实际上它并没有说明将该脚本放在哪里。

any help please?

有什么帮助吗?

回答by Guy Gavriely

luckily scrapy source is open, so you can follow the way crawl commandworks and do the same in your code:

幸运的是scrapy源是开放的,所以你可以按照crawl命令的工作方式在你的代码中做同样的事情:

...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

回答by Elias Dorneles

You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

您可以只创建一个普通的 Python 脚本,然后使用 Scrapy 的命令行选项runspider,它允许您运行蜘蛛而无需创建项目。

For example, you can create a single file stackoverflow_spider.pywith something like this:

例如,您可以使用以下内容创建单个文件stackoverflow_spider.py

import scrapy

class QuestionItem(scrapy.item.Item):
    idx = scrapy.item.Field()
    title = scrapy.item.Field()

class StackoverflowSpider(scrapy.spider.Spider):
    name = 'SO'
    start_urls = ['http://stackoverflow.com']
    def parse(self, response):
        sel = scrapy.selector.Selector(response)
        questions = sel.css('#question-mini-list .question-summary')
        for i, elem in enumerate(questions):
            l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
            l.add_value('idx', i)
            l.add_xpath('title', ".//h3/a/text()")
            yield l.load_item()

Then, provided you have scrapy properly installed, you can run it using:

然后,如果您正确安装了scrapy,您可以使用以下命令运行它:

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

回答by Almog Cohen

It is simple and straightforward :)

它简单明了:)

Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.pyand not every time you just import from it. Just add an if __name__ == "__main__":

只需查看官方文档。我会在那里做一些改变,这样你就可以控制蜘蛛只在你这样做时运行,python myscript.py而不是每次只从它导入时运行。只需添加一个if __name__ == "__main__"

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    pass

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished

Now save the file as myscript.pyand run 'python myscript.py`.

现在将文件另存为myscript.py并运行“python myscript.py”。

Enjoy!

享受!

回答by Aminah Nuraini

Why don't you just do this?

你为什么不这样做?

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

Put that script in the same path where you put scrapy.cfg

将该脚本放在您放置的同一路径中 scrapy.cfg