Python 从脚本scrapy运行蜘蛛
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21662689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
scrapy run spider from script
提问by Marco Dinatsoli
I want to run my spider from a script rather than a scrap crawl
我想从脚本而不是从脚本运行我的蜘蛛 scrap crawl
I found this page
我找到了这个页面
http://doc.scrapy.org/en/latest/topics/practices.html
http://doc.scrapy.org/en/latest/topics/practices.html
but actually it doesn't say where to put that script.
但实际上它并没有说明将该脚本放在哪里。
any help please?
有什么帮助吗?
回答by Guy Gavriely
luckily scrapy source is open, so you can follow the way crawl commandworks and do the same in your code:
幸运的是scrapy源是开放的,所以你可以按照crawl命令的工作方式在你的代码中做同样的事情:
...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()
回答by Elias Dorneles
You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.
您可以只创建一个普通的 Python 脚本,然后使用 Scrapy 的命令行选项runspider,它允许您运行蜘蛛而无需创建项目。
For example, you can create a single file stackoverflow_spider.pywith something like this:
例如,您可以使用以下内容创建单个文件stackoverflow_spider.py:
import scrapy
class QuestionItem(scrapy.item.Item):
idx = scrapy.item.Field()
title = scrapy.item.Field()
class StackoverflowSpider(scrapy.spider.Spider):
name = 'SO'
start_urls = ['http://stackoverflow.com']
def parse(self, response):
sel = scrapy.selector.Selector(response)
questions = sel.css('#question-mini-list .question-summary')
for i, elem in enumerate(questions):
l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
l.add_value('idx', i)
l.add_xpath('title', ".//h3/a/text()")
yield l.load_item()
Then, provided you have scrapy properly installed, you can run it using:
然后,如果您正确安装了scrapy,您可以使用以下命令运行它:
scrapy runspider stackoverflow_spider.py -t json -o questions-items.json
回答by Almog Cohen
It is simple and straightforward :)
它简单明了:)
Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.pyand not every time you just import from it. Just add an if __name__ == "__main__":
只需查看官方文档。我会在那里做一些改变,这样你就可以控制蜘蛛只在你这样做时运行,python myscript.py而不是每次只从它导入时运行。只需添加一个if __name__ == "__main__":
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
pass
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Now save the file as myscript.pyand run 'python myscript.py`.
现在将文件另存为myscript.py并运行“python myscript.py”。
Enjoy!
享受!
回答by Aminah Nuraini
Why don't you just do this?
你为什么不这样做?
from scrapy import cmdline
cmdline.execute("scrapy crawl myspider".split())
Put that script in the same path where you put scrapy.cfg
将该脚本放在您放置的同一路径中 scrapy.cfg

