我如何在scrapy python中使用多个请求并在它们之间传递项目

Question

提问by user1858027

I have the itemobject and i need to pass that along many pages to store data in single item

我有item对象，我需要将它传递到许多页面以将数据存储在单个项目中

LIke my item is

就像我的项目是

class DmozItem(Item):
    title = Field()
    description1 = Field()
    description2 = Field()
    description3 = Field()

Now those three description are in three separate pages. i want to do somrething like

现在这三个描述位于三个单独的页面中。我想做一些像

Now this works good for parseDescription1

现在这对 parseDescription1

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item
    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

But i want something like

但我想要类似的东西

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []
    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)
    request.meta['item'] = item

    return request 

def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return item

def parseDescription2(self,response):
    item = response.meta['item']
    item['desc2'] = "test2"
    return item

def parseDescription3(self,response):
    item = response.meta['item']
    item['desc3'] = "test3"
    return item

Answer 1

采纳答案by warvariuc

No problem. Following is correct version of your code:

没问题。以下是您的代码的正确版本：

def page_parser(self, response):
      sites = hxs.select('//div[@class="row"]')
      items = []

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
      request.meta['item'] = item
      yield request

      request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
      yield request

      yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})

def parseDescription1(self,response):
            item = response.meta['item']
            item['desc1'] = "test"
            return item

def parseDescription2(self,response):
            item = response.meta['item']
            item['desc2'] = "test2"
            return item

def parseDescription3(self,response):
            item = response.meta['item']
            item['desc3'] = "test3"
            return item

Answer 2

回答by Dave McLain

In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:

为了保证请求/回调的顺序并且最终只返回一个项目，您需要使用如下形式链接您的请求：

  def page_parser(self, response):
        sites = hxs.select('//div[@class="row"]')
        items = []

        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
        request.meta['item'] = Item()
        return [request]


  def parseDescription1(self,response):
        item = response.meta['item']
        item['desc1'] = "test"
        return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]


  def parseDescription2(self,response):
        item = response.meta['item']
        item['desc2'] = "test2"
        return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]

  def parseDescription3(self,response):
        item = response.meta['item']
        item['desc3'] = "test3"
        return [item]

Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.

每个回调函数返回一个可迭代的项目或请求，请求被安排，项目通过您的项目管道运行。

If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.

如果您从每个回调中返回一个项目，您将在管道中得到 4 个处于不同完整性状态的项目，但是如果您返回下一个请求，那么您可以保证请求的顺序，并且您将拥有执行结束时的一项。

Answer 3

回答by oliverguenther

The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].

接受的答案总共返回三个项目 [desc(i) 设置为 i=1,2,3]。

If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1, parseDescription2, and parseDescription3to succeed and run without errors in order to return the item.

如果你想返回一个项目，戴夫·麦克莱恩的项目做的工作，但它需要parseDescription1，parseDescription2以及parseDescription3获得成功，并没有错误，以回报的项目上运行。

For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.

对于我的用例，一些子请求可能会随机返回 HTTP 403/404 错误，因此我丢失了一些项目，即使我可以部分抓取它们。

Workaround

解决方法

Thus, I currently employ the following workaround: Instead of only passing the item around in the request.metadict, pass around a call stackthat knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.

因此，我目前采用以下解决方法：不是只在request.metadict 中传递项目，而是传递一个知道接下来要调用什么请求的调用堆栈。它将调用堆栈上的下一个项目（只要它不为空），如果堆栈为空则返回该项目。

The errbackrequest parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.

该errback请求参数用于返回到错误时调度方法，只是继续下一个堆栈的项目。

def callnext(self, response):
    ''' Call next target for the item loader, or yields it if completed. '''

    # Get the meta object from the request, as the response
    # does not contain it.
    meta = response.request.meta

    # Items remaining in the stack? Execute them
    if len(meta['callstack']) > 0:
        target = meta['callstack'].pop(0)
        yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
    else:
        yield meta['loader'].load_item()

def parseDescription1(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    # Build the call stack
    callstack = [
        {'url': "http://www.example.com/lin2.cpp",
        'callback': self.parseDescription2 },
        {'url': "http://www.example.com/lin3.cpp",
        'callback': self.parseDescription3 }
    ]

    return self.callnext(response)

def parseDescription2(self, response):

    # Recover item(loader)
    l = response.meta['loader']

    # Use just as before
    l.add_css(...)

    return self.callnext(response)


def parseDescription3(self, response):

    # ...

    return self.callnext(response)

Warning

警告

This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.

此解决方案仍然是同步的，如果回调中有任何异常，仍然会失败。

For more information, check the blog post I wrote about that solution.

有关更多信息，请查看我写的有关该解决方案的博客文章。

Answer 4

回答by RockJake28

All of the answers provided do have their pros and cons. I'm just adding an extra one to demonstrate how this has been simplified due to changes in the codebase(both Python & Scrapy). We no longer need to use metaand can instead use cb_kwargs(i.e. keyword arguments to pass to the callback function).

提供的所有答案都有其优点和缺点。我只是添加了一个额外的代码来演示如何由于代码库（Python 和 Scrapy）的变化而简化了这一过程。我们不再需要使用meta，而是可以使用cb_kwargs（即传递给回调函数的关键字参数）。

So instead of doing this:

所以不要这样做：

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []

    request = Request("http://www.example.com/lin1.cpp",
                      callback=self.parseDescription1)
    request.meta['item'] = Item()
    return [request]


def parseDescription1(self,response):
    item = response.meta['item']
    item['desc1'] = "test"
    return [Request("http://www.example.com/lin2.cpp",
                    callback=self.parseDescription2, meta={'item': item})]
...

We can do this:

我们可以完成这个：

def page_parser(self, response):
    sites = hxs.select('//div[@class="row"]')
    items = []

    yield response.follow("http://www.example.com/lin1.cpp",
                          callback=self.parseDescription1,
                          cb_kwargs={"item": item()})


def parseDescription1(self,response, item):
    item['desc1'] = "More data from this new response"
    yield response.follow("http://www.example.com/lin2.cpp",
                          callback=self.parseDescription2,
                          cb_kwargs={'item': item})
...

and if for some reason you have multiple links you want to process with the same function, we can swap

如果由于某种原因您有多个链接要使用相同的功能进行处理，我们可以交换

yield response.follow(a_single_url,
                      callback=some_function,
                      cb_kwargs={"data": to_pass_to_callback})

with

和

yield from response.follow_all([many, urls, to, parse],
                               callback=some_function,
                               cb_kwargs={"data": to_pass_to_callback})

我如何在scrapy python中使用多个请求并在它们之间传递项目

提问by user1858027

采纳答案by warvariuc

回答by Dave McLain

回答by oliverguenther

Workaround

解决方法

Warning

警告

回答by RockJake28

相关推荐

最近更新

标签

我如何在scrapy python中使用多个请求并在它们之间传递项目

提问by user1858027

采纳答案by warvariuc

回答by Dave McLain

回答by oliverguenther

Workaround

解决方法

Warning

警告

回答by RockJake28

相关推荐

如何在窗口 xp / 7 中从批处理文件调用/运行多个 python 脚本

执行中的 Python sqlite3 字符串变量

在 Python 中什么是全局语句？

Python 斯坦福解析器和 NLTK

相关推荐

最近更新

标签