Python scrapy - 解析分页的项目
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12847965/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
scrapy - parsing items that are paginated
提问by AlexBrand
I have a url of the form:
我有一个形式的网址:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
共有 53 页,每页约 20 行。
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
我基本上想从所有页面中获取所有行,即 ~53*20 项。
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
我的 parse 方法中有工作代码,它解析单个页面,每个项目也深入一页,以获取有关该项目的更多信息:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
The question is, how do I crawl each page?
问题是,我如何抓取每个页面?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
采纳答案by Achim
You have two options to solve your problem. The general one is to use yieldto generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
您有两种选择来解决您的问题。一般的一种是用来yield生成新的请求而不是return. 这样,您可以从单个回调发出多个新请求。检查http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example 上的第二个示例。
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
在您的情况下,可能有一个更简单的解决方案:只需从这样的模式中生成 start urs 列表:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
回答by bslima
You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.
您可以使用 CrawlSpider 而不是 BaseSpider 并使用 SgmlLinkExtractor 提取分页中的页面。
For instance:
例如:
start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
, follow= True),
Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
, callback='parse_call')
)
The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.
第一条规则告诉scrapy 跟随xpath 表达式中包含的链接,第二条规则告诉scrapy 调用parse_call 到xpath 表达式中包含的链接,以防您想解析每个页面中的某些内容。
For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
有关更多信息,请参阅文档:http: //doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
回答by Santosh Pillai
There can be two use cases for 'scrapy - parsing items that are paginated'.
'scrapy - 解析分页的项目'可能有两个用例。
A). We just want to move across the table and fetch data. This is relatively straight forward.
一种)。我们只想在表中移动并获取数据。这是相对直接的。
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Observe the last 4 lines. Here
观察最后 4 行。这里
- We are getting the next page link form next page xpath from the 'Next' pagination button.
- if condition to check if its not the end of the pagination.
- Join this link (that we got in step 1) with the main url using url join
- A recursive call to the
parsecall back method.
- 我们正在从“下一页”分页按钮获取下一页 xpath 的下一页链接。
- if 条件来检查它是否不是分页的结尾。
- 使用 url join 将此链接(我们在步骤 1 中获得)与主 url 连接
- 对
parse回调方法的递归调用。
B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.
B) 我们不仅要跨页面移动,还要从该页面中的一个或多个链接中提取数据。
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Overhere, observe that:
在这里,请注意:
We are using the
CrawlSpidersubclass of thescrapy.Spiderparent classWe have set to 'Rules'
a) The first rule, just checks if there is a 'next_page' available and follows it.
b) The second rule requests for all the links on a page that are in the format, say
/trains/12343and then calls theparse_trainsto perform and parsing operation.Important: Note that we don't want to use the regular
parsemethod over here as we are usingCrawlSpidersubclass. This class also has aparsemethod so we don't want to override that. Just remember to name your call back method something other thanparse.
我们正在使用父类的
CrawlSpider子scrapy.Spider类我们已设置为“规则”
a) 第一条规则,只检查是否有可用的“next_page”并遵循它。
b) 第二条规则请求页面上所有格式的链接
/trains/12343,然后调用parse_trains执行和解析操作。重要提示:请注意,我们不想
parse在这里使用常规方法,因为我们使用的是CrawlSpider子类。这个类也有一个parse方法,所以我们不想覆盖它。请记住将您的回调方法命名为parse.

