Python Scrapy - 如何管理 cookie/会话
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4981440/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrapy - how to manage cookies/sessions
提问by Acorn
I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.
我对 cookie 如何与 Scrapy 一起工作以及如何管理这些 cookie 感到有些困惑。
This is basically a simplified version of what I'm trying to do:

这基本上是我正在尝试做的事情的简化版本:

The way the website works:
网站的运作方式:
When you visit the website you get a session cookie.
当您访问该网站时,您会获得一个会话 cookie。
When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.
当您进行搜索时,该网站会记住您搜索的内容,因此当您执行诸如转到下一页结果之类的操作时,它知道它正在处理的搜索。
My script:
我的脚本:
My spider has a start url of searchpage_url
我的蜘蛛有一个 searchpage_url 的起始网址
The searchpage is requested by parse()and the search form response gets passed to search_generator()
parse()搜索页面被请求,搜索表单响应被传递给search_generator()
search_generator()then yields lots of search requests using FormRequestand the search form response.
search_generator()然后yield使用大量搜索请求FormRequest和搜索表单响应。
Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.
每个 FormRequests 和后续的子请求都需要有自己的会话,所以需要有自己的单独的 cookiejar 和自己的会话 cookie。
I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?
我已经看到文档中讨论阻止合并 cookie 的元选项的部分。这实际上意味着什么?这是否意味着发出请求的蜘蛛将在其余生中拥有自己的 cookiejar?
If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?
如果 cookie 是在每个蜘蛛级别上的,那么当生成多个蜘蛛时它是如何工作的?是否可以只让第一个请求生成器产生新的蜘蛛,并确保从那时起只有那个蜘蛛处理未来的请求?
I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?
我假设我必须禁用多个并发请求。否则一个蜘蛛会在同一个会话 cookie 下进行多次搜索,而未来的请求只会与最近的搜索有关?
I'm confused, any clarification would be greatly received!
我很困惑,任何澄清都会很受欢迎!
EDIT:
编辑:
Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.
我刚刚想到的另一种选择是完全手动管理会话 cookie,并将其从一个请求传递到另一个请求。
I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.
我想这意味着禁用 cookie.. 然后从搜索响应中获取会话 cookie,并将其传递给每个后续请求。
Is this what you should do in this situation?
这是你在这种情况下应该做的吗?
采纳答案by Noah_S
Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
三年后,我认为这正是你要找的:http: //doc.scrapy.org/en/latest/topics/downloader-middleware.html#std: reqmeta-cookiejar
Just use something like this in your spider's start_requests method:
只需在蜘蛛的 start_requests 方法中使用类似的东西:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:
请记住,对于后续请求,您每次都需要明确地重新附加 cookiejar:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
回答by Pablo Hoffman
I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:
我认为最简单的方法是使用搜索查询作为蜘蛛参数(将在构造函数中接收)运行同一蜘蛛的多个实例,以便重用 Scrapy 的 cookie 管理功能。因此,您将拥有多个蜘蛛实例,每一个都抓取一个特定的搜索查询及其结果。但是你需要自己运行蜘蛛:
scrapy crawl myspider -a search_query=something
Or you can use Scrapyd for running all the spiders through the JSON API.
或者您可以使用 Scrapyd 通过 JSON API 运行所有蜘蛛。
回答by warvariuc
from scrapy.http.cookies import CookieJar
...
class Spider(BaseSpider):
def parse(self, response):
'''Parse category page, extract subcategories links.'''
hxs = HtmlXPathSelector(response)
subcategories = hxs.select(".../@href")
for subcategorySearchLink in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
yield Request(subcategorySearchLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True})
'''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
This is needed because the site uses sessions to remember the search parameters.'''
def extractItemLinks(self, response):
'''Extract item links from subcategory page and go to next page.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select(".../a/@href"):
itemLink = urlparse.urljoin(response.url, itemLink)
print 'Requesting item page %s' % itemLink
yield Request(...)
nextPageLink = self.getFirst(".../@href", hxs)
if nextPageLink:
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.extractItemLinks,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
yield request
else:
self.log('Whole subcategory scraped.', log.DEBUG)
回答by MKatleast3
def parse(self, response):
# do something
yield scrapy.Request(
url= "http://new-page-to-parse.com/page/4/",
cookies= {
'h0':'blah',
'taeyeon':'pretty'
},
callback= self.parse
)

