Python 从脚本scrapy运行蜘蛛

Question

提问by Marco Dinatsoli

I want to run my spider from a script rather than a scrap crawl

我想从脚本而不是从脚本运行我的蜘蛛 scrap crawl

I found this page

我找到了这个页面

http://doc.scrapy.org/en/latest/topics/practices.html

but actually it doesn't say where to put that script.

但实际上它并没有说明将该脚本放在哪里。

any help please?

有什么帮助吗？

Answer 1

回答by Guy Gavriely

luckily scrapy source is open, so you can follow the way crawl commandworks and do the same in your code:

幸运的是scrapy源是开放的，所以你可以按照crawl命令的工作方式在你的代码中做同样的事情：

...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

Answer 2

回答by Elias Dorneles

You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

您可以只创建一个普通的 Python 脚本，然后使用 Scrapy 的命令行选项runspider，它允许您运行蜘蛛而无需创建项目。

For example, you can create a single file stackoverflow_spider.pywith something like this:

例如，您可以使用以下内容创建单个文件stackoverflow_spider.py：

import scrapy

class QuestionItem(scrapy.item.Item):
    idx = scrapy.item.Field()
    title = scrapy.item.Field()

class StackoverflowSpider(scrapy.spider.Spider):
    name = 'SO'
    start_urls = ['http://stackoverflow.com']
    def parse(self, response):
        sel = scrapy.selector.Selector(response)
        questions = sel.css('#question-mini-list .question-summary')
        for i, elem in enumerate(questions):
            l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
            l.add_value('idx', i)
            l.add_xpath('title', ".//h3/a/text()")
            yield l.load_item()

Then, provided you have scrapy properly installed, you can run it using:

然后，如果您正确安装了scrapy，您可以使用以下命令运行它：

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

Answer 3

回答by Almog Cohen

It is simple and straightforward :)

它简单明了:)

Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.pyand not every time you just import from it. Just add an if __name__ == "__main__":

只需查看官方文档。我会在那里做一些改变，这样你就可以控制蜘蛛只在你这样做时运行，python myscript.py而不是每次只从它导入时运行。只需添加一个if __name__ == "__main__"：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    pass

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished

Now save the file as myscript.pyand run 'python myscript.py`.

现在将文件另存为myscript.py并运行“python myscript.py”。

Enjoy!

享受！

Answer 4

回答by Aminah Nuraini

Why don't you just do this?

你为什么不这样做？

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

Put that script in the same path where you put scrapy.cfg

将该脚本放在您放置的同一路径中 scrapy.cfg

Python 从脚本scrapy运行蜘蛛

提问by Marco Dinatsoli

回答by Guy Gavriely

回答by Elias Dorneles

回答by Almog Cohen

回答by Aminah Nuraini

相关推荐

最近更新

标签

Python 从脚本scrapy运行蜘蛛

提问by Marco Dinatsoli

回答by Guy Gavriely

回答by Elias Dorneles

回答by Almog Cohen

回答by Aminah Nuraini

相关推荐

Python 如何在 anaconda 中安装 PyQt4？

Python scipy.cluster.hierarchy 教程

Python 如何将元素添加到 OrderedDict 的开头？

Python - 将 dict 转储为 json 字符串

相关推荐

最近更新

标签