Python 如何在scrapy蜘蛛中传递用户定义的参数

Question

提问by L Lawliet

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

我正在尝试将用户定义的参数传递给 scrapy 的蜘蛛。任何人都可以建议如何做到这一点？

I read about a parameter -asomewhere but have no idea how to use it.

我在-a某处读到了一个参数，但不知道如何使用它。

Answer 1

采纳答案by Steven Almeroth

Spider arguments are passed in the crawlcommand using the -aoption. For example:

Spider 参数crawl使用-a选项在命令中传递。例如：

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

蜘蛛可以访问参数作为属性：

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

取自 Scrapy 文档：http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

2013 年更新：添加第二个参数

Update 2015: Adjust wording

2015 年更新：调整措辞

Update 2016: Use newer base class and add super, thanks @Birla

2016 年更新：使用更新的基类并添加超级，感谢 @Birla

Update 2017: Use Python3 super

2017 年更新：使用 Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes

2018 年更新：正如@eLRuLL 指出的，蜘蛛可以将参数作为属性访问

Answer 2

回答by Hassan Raza

To pass arguments with crawl command

使用 crawl 命令传递参数

scrapy crawl myspider -a category='mycategory' -a domain='example.com'

To pass arguments to run on scrapyd replace -awith -d

要传递参数以在 scrapyd 上运行，请将-a替换为-d

curl http://your.ip.address.here:port/schedule.json-d spider=myspider -d category='mycategory' -d domain='example.com'

The spider will receive arguments in its constructor.

蜘蛛将在其构造函数中接收参数。


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy puts all the arguments as spider attributes and you can skip the initmethod completely. Beware use getattrmethod for getting those attributes so your code does not break.

Scrapy 将所有参数都作为蜘蛛属性，你可以完全跳过init方法。当心使用getattr方法获取这些属性，这样您的代码就不会中断。


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')

Answer 3

回答by Siyaram Malav

Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-

Spider 参数在使用 -a 选项运行 crawl 命令时传递。例如，如果我想将域名作为参数传递给我的蜘蛛，那么我会这样做-

scrapy crawl myspider -a domain="http://www.example.com"

And receive arguments in spider's constructors:

并在蜘蛛的构造函数中接收参数：

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

...

it will work :)

它会起作用:)

Answer 4

回答by eLRuLL

Previous answers were correct, but you don't have to declare the constructor (__init__) every time you want to code a scrapy's spider, you could just specify the parameters as before:

以前的答案是正确的，但是您不必在__init__每次要编写scrapy 蜘蛛程序时都声明构造函数 ( )，您可以像以前一样指定参数：

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

and in your spider code you can just use them as spider arguments:

在您的蜘蛛代码中，您可以将它们用作蜘蛛参数：

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

And it just works.

它只是有效。

Answer 5

回答by Nagendran

Alternatively we can use ScrapyDwhich expose an API where we can pass the start_url and spider name. ScrapyD has api's to stop/start/status/list the spiders.

或者，我们可以使用ScrapyD，它公开一个 API，我们可以在其中传递 start_url 和蜘蛛名称。ScrapyD 有 api 来停止/启动/状态/列出蜘蛛。

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploywill deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.

scrapyd-deploy将以蛋的形式将蜘蛛部署到守护进程中，甚至它维护蜘蛛的版本。在启动蜘蛛时，您可以提及要使用的蜘蛛版本。

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API

额外的优势是您可以构建自己的 UI 来接受来自用户的 url 和其他参数，并使用上面的 scrapyd schedule API 安排任务

Refer scrapyd API documentationfor more details

有关更多详细信息，请参阅scrapyd API 文档

Python 如何在scrapy蜘蛛中传递用户定义的参数

提问by L Lawliet

采纳答案by Steven Almeroth

回答by Hassan Raza

回答by Siyaram Malav

回答by eLRuLL

回答by Nagendran

相关推荐

最近更新

标签

Python 如何在scrapy蜘蛛中传递用户定义的参数

提问by L Lawliet

采纳答案by Steven Almeroth

回答by Hassan Raza

回答by Siyaram Malav

回答by eLRuLL

回答by Nagendran

相关推荐

Python 在 Matplotlib 中的 x 和基线 x 位置之间填充

如何为python正确使用2to3？

Python 如何使用 Flask 检索会话数据？

在 Mac（IDE？）上编写 Python 脚本的最佳方式

相关推荐

最近更新

标签