我如何在scrapy python中使用多个请求并在它们之间传递项目
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13910357/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can i use multiple requests and pass items in between them in scrapy python
提问by user1858027
I have the itemobject and i need to pass that along many pages to store data in single item
我有item对象,我需要将它传递到许多页面以将数据存储在单个项目中
LIke my item is
就像我的项目是
class DmozItem(Item):
title = Field()
description1 = Field()
description2 = Field()
description3 = Field()
Now those three description are in three separate pages. i want to do somrething like
现在这三个描述位于三个单独的页面中。我想做一些像
Now this works good for parseDescription1
现在这对 parseDescription1
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1)
request.meta['item'] = item
return request
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
But i want something like
但我想要类似的东西
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1)
request.meta['item'] = item
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2)
request.meta['item'] = item
request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2)
request.meta['item'] = item
return request
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return item
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return item
采纳答案by warvariuc
No problem. Following is correct version of your code:
没问题。以下是您的代码的正确版本:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
request.meta['item'] = item
yield request
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})
yield request
yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return item
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return item
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return item
回答by Dave McLain
In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:
为了保证请求/回调的顺序并且最终只返回一个项目,您需要使用如下形式链接您的请求:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)
request.meta['item'] = Item()
return [request]
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]
def parseDescription2(self,response):
item = response.meta['item']
item['desc2'] = "test2"
return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]
def parseDescription3(self,response):
item = response.meta['item']
item['desc3'] = "test3"
return [item]
Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.
每个回调函数返回一个可迭代的项目或请求,请求被安排,项目通过您的项目管道运行。
If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.
如果您从每个回调中返回一个项目,您将在管道中得到 4 个处于不同完整性状态的项目,但是如果您返回下一个请求,那么您可以保证请求的顺序,并且您将拥有执行结束时的一项。
回答by oliverguenther
The accepted answer returns a total of three items [with desc(i) set for i=1,2,3].
接受的答案总共返回三个项目 [desc(i) 设置为 i=1,2,3]。
If you want to return a single item, Dave McLain's item does work, however it requires parseDescription1, parseDescription2, and parseDescription3to succeed and run without errors in order to return the item.
如果你想返回一个项目,戴夫·麦克莱恩的项目做的工作,但它需要parseDescription1,parseDescription2以及parseDescription3获得成功,并没有错误,以回报的项目上运行。
For my use case, some of the subrequests MAY return HTTP 403/404 errors at random, thus I lost some of the items, even though I could have scraped them partially.
对于我的用例,一些子请求可能会随机返回 HTTP 403/404 错误,因此我丢失了一些项目,即使我可以部分抓取它们。
Workaround
解决方法
Thus, I currently employ the following workaround: Instead of only passing the item around in the request.metadict, pass around a call stackthat knows what request to call next. It will call the next item on the stack (so long as it isn't empty), and returns the item if the stack is empty.
因此,我目前采用以下解决方法:不是只在request.metadict 中传递项目,而是传递一个知道接下来要调用什么请求的调用堆栈。它将调用堆栈上的下一个项目(只要它不为空),如果堆栈为空则返回该项目。
The errbackrequest parameter is used to return to the dispatcher method upon errors and simply continue with the next stack item.
该errback请求参数用于返回到错误时调度方法,只是继续下一个堆栈的项目。
def callnext(self, response):
''' Call next target for the item loader, or yields it if completed. '''
# Get the meta object from the request, as the response
# does not contain it.
meta = response.request.meta
# Items remaining in the stack? Execute them
if len(meta['callstack']) > 0:
target = meta['callstack'].pop(0)
yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
else:
yield meta['loader'].load_item()
def parseDescription1(self, response):
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
# Build the call stack
callstack = [
{'url': "http://www.example.com/lin2.cpp",
'callback': self.parseDescription2 },
{'url': "http://www.example.com/lin3.cpp",
'callback': self.parseDescription3 }
]
return self.callnext(response)
def parseDescription2(self, response):
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
return self.callnext(response)
def parseDescription3(self, response):
# ...
return self.callnext(response)
Warning
警告
This solution is still synchronous, and will still fail if you have any exceptions within the callbacks.
此解决方案仍然是同步的,如果回调中有任何异常,仍然会失败。
For more information, check the blog post I wrote about that solution.
有关更多信息,请查看我写的有关该解决方案的博客文章。
回答by RockJake28
All of the answers provided do have their pros and cons. I'm just adding an extra one to demonstrate how this has been simplified due to changes in the codebase(both Python & Scrapy). We no longer need to use metaand can instead use cb_kwargs(i.e. keyword arguments to pass to the callback function).
提供的所有答案都有其优点和缺点。我只是添加了一个额外的代码来演示如何由于代码库(Python 和 Scrapy)的变化而简化了这一过程。我们不再需要使用meta,而是可以使用cb_kwargs(即传递给回调函数的关键字参数)。
So instead of doing this:
所以不要这样做:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
request = Request("http://www.example.com/lin1.cpp",
callback=self.parseDescription1)
request.meta['item'] = Item()
return [request]
def parseDescription1(self,response):
item = response.meta['item']
item['desc1'] = "test"
return [Request("http://www.example.com/lin2.cpp",
callback=self.parseDescription2, meta={'item': item})]
...
We can do this:
我们可以完成这个:
def page_parser(self, response):
sites = hxs.select('//div[@class="row"]')
items = []
yield response.follow("http://www.example.com/lin1.cpp",
callback=self.parseDescription1,
cb_kwargs={"item": item()})
def parseDescription1(self,response, item):
item['desc1'] = "More data from this new response"
yield response.follow("http://www.example.com/lin2.cpp",
callback=self.parseDescription2,
cb_kwargs={'item': item})
...
and if for some reason you have multiple links you want to process with the same function, we can swap
如果由于某种原因您有多个链接要使用相同的功能进行处理,我们可以交换
yield response.follow(a_single_url,
callback=some_function,
cb_kwargs={"data": to_pass_to_callback})
with
和
yield from response.follow_all([many, urls, to, parse],
callback=some_function,
cb_kwargs={"data": to_pass_to_callback})

