Python 如何在scrapy蜘蛛中传递用户定义的参数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15611605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to pass a user defined argument in scrapy spider
提问by L Lawliet
I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?
我正在尝试将用户定义的参数传递给 scrapy 的蜘蛛。任何人都可以建议如何做到这一点?
I read about a parameter -asomewhere but have no idea how to use it.
我在-a某处读到了一个参数,但不知道如何使用它。
采纳答案by Steven Almeroth
Spider arguments are passed in the crawlcommand using the -aoption. For example:
Spider 参数crawl使用-a选项在命令中传递。例如:
scrapy crawl myspider -a category=electronics -a domain=system
Spiders can access arguments as attributes:
蜘蛛可以访问参数作为属性:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3
def parse(self, response)
self.log(self.domain) # system
Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
取自 Scrapy 文档:http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
Update 2013: Add second argument
2013 年更新:添加第二个参数
Update 2015: Adjust wording
2015 年更新:调整措辞
Update 2016: Use newer base class and add super, thanks @Birla
2016 年更新:使用更新的基类并添加超级,感谢 @Birla
Update 2017: Use Python3 super
2017 年更新:使用 Python3 super
# previously
super(MySpider, self).__init__(**kwargs) # python2
Update 2018: As @eLRuLL points out, spiders can access arguments as attributes
2018 年更新:正如@eLRuLL 指出的,蜘蛛可以将参数作为属性访问
回答by Hassan Raza
To pass arguments with crawl command
使用 crawl 命令传递参数
scrapy crawl myspider -a category='mycategory' -a domain='example.com'
scrapy crawl myspider -a category='mycategory' -a domain='example.com'
To pass arguments to run on scrapyd replace -awith -d
要传递参数以在 scrapyd 上运行,请将-a替换为-d
curl http://your.ip.address.here:port/schedule.json-d spider=myspider -d category='mycategory' -d domain='example.com'
curl http://your.ip.address.here:port/schedule.json-d spider=myspider -d category='mycategory' -d domain='example.com'
The spider will receive arguments in its constructor.
蜘蛛将在其构造函数中接收参数。
class MySpider(Spider):
name="myspider"
def __init__(self,category='',domain='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.category = category
self.domain = domain
Scrapy puts all the arguments as spider attributes and you can skip the initmethod completely. Beware use getattrmethod for getting those attributes so your code does not break.
Scrapy 将所有参数都作为蜘蛛属性,你可以完全跳过init方法。当心使用getattr方法获取这些属性,这样您的代码就不会中断。
class MySpider(Spider):
name="myspider"
start_urls = ('https://httpbin.org/ip',)
def parse(self,response):
print getattr(self,'category','')
print getattr(self,'domain','')
回答by Siyaram Malav
Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-
Spider 参数在使用 -a 选项运行 crawl 命令时传递。例如,如果我想将域名作为参数传递给我的蜘蛛,那么我会这样做-
scrapy crawl myspider -a domain="http://www.example.com"
scrapy crawl myspider -a domain="http://www.example.com"
And receive arguments in spider's constructors:
并在蜘蛛的构造函数中接收参数:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, domain='', *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
#
...
...
it will work :)
它会起作用:)
回答by eLRuLL
Previous answers were correct, but you don't have to declare the constructor (__init__) every time you want to code a scrapy's spider, you could just specify the parameters as before:
以前的答案是正确的,但是您不必在__init__每次要编写scrapy 蜘蛛程序时都声明构造函数 ( ),您可以像以前一样指定参数:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
and in your spider code you can just use them as spider arguments:
在您的蜘蛛代码中,您可以将它们用作蜘蛛参数:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
And it just works.
它只是有效。
回答by Nagendran
Alternatively we can use ScrapyDwhich expose an API where we can pass the start_url and spider name. ScrapyD has api's to stop/start/status/list the spiders.
或者,我们可以使用ScrapyD,它公开一个 API,我们可以在其中传递 start_url 和蜘蛛名称。ScrapyD 有 api 来停止/启动/状态/列出蜘蛛。
pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default
scrapyd-deploywill deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.
scrapyd-deploy将以蛋的形式将蜘蛛部署到守护进程中,甚至它维护蜘蛛的版本。在启动蜘蛛时,您可以提及要使用的蜘蛛版本。
class MySpider(CrawlSpider):
def __init__(self, start_urls, *args, **kwargs):
self.start_urls = start_urls.split('|')
super().__init__(*args, **kwargs)
name = testspider
curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"
curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"
Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API
额外的优势是您可以构建自己的 UI 来接受来自用户的 url 和其他参数,并使用上面的 scrapyd schedule API 安排任务
Refer scrapyd API documentationfor more details
有关更多详细信息,请参阅scrapyd API 文档

