Python scrapy - 解析分页的项目

Question

提问by AlexBrand

I have a url of the form:

我有一个形式的网址：

example.com/foo/bar/page_1.html

There are a total of 53 pages, each one of them has ~20 rows.

共有 53 页，每页约 20 行。

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

我基本上想从所有页面中获取所有行，即 ~53*20 项。

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

我的 parse 方法中有工作代码，它解析单个页面，每个项目也深入一页，以获取有关该项目的更多信息：

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

The question is, how do I crawl each page?

问题是，我如何抓取每个页面？

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

Answer 1

采纳答案by Achim

You have two options to solve your problem. The general one is to use yieldto generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

您有两种选择来解决您的问题。一般的一种是用来yield生成新的请求而不是return. 这样，您可以从单个回调发出多个新请求。检查http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example 上的第二个示例。

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

在您的情况下，可能有一个更简单的解决方案：只需从这样的模式中生成 start urs 列表：

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

Answer 2

回答by bslima

You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.

您可以使用 CrawlSpider 而不是 BaseSpider 并使用 SgmlLinkExtractor 提取分页中的页面。

For instance:

例如：

start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
                , follow= True),
          Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
                , callback='parse_call')
    )

The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.

第一条规则告诉scrapy 跟随xpath 表达式中包含的链接，第二条规则告诉scrapy 调用parse_call 到xpath 表达式中包含的链接，以防您想解析每个页面中的某些内容。

For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

有关更多信息，请参阅文档：http: //doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Answer 3

回答by Santosh Pillai

There can be two use cases for 'scrapy - parsing items that are paginated'.

'scrapy - 解析分页的项目'可能有两个用例。

A). We just want to move across the table and fetch data. This is relatively straight forward.

一种）。我们只想在表中移动并获取数据。这是相对直接的。

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Observe the last 4 lines. Here

观察最后 4 行。这里

We are getting the next page link form next page xpath from the 'Next' pagination button.
if condition to check if its not the end of the pagination.
Join this link (that we got in step 1) with the main url using url join
A recursive call to the parsecall back method.

我们正在从“下一页”分页按钮获取下一页 xpath 的下一页链接。
if 条件来检查它是否不是分页的结尾。
使用 url join 将此链接（我们在步骤 1 中获得）与主 url 连接
对parse回调方法的递归调用。

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

B) 我们不仅要跨页面移动，还要从该页面中的一个或多个链接中提取数据。

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
    '''do your parsing here'''

Overhere, observe that:

在这里，请注意：

We are using the CrawlSpidersubclass of the scrapy.Spiderparent class
We have set to 'Rules'
a) The first rule, just checks if there is a 'next_page' available and follows it.
b) The second rule requests for all the links on a page that are in the format, say /trains/12343and then calls the parse_trainsto perform and parsing operation.
Important: Note that we don't want to use the regular parsemethod over here as we are using CrawlSpidersubclass. This class also has a parsemethod so we don't want to override that. Just remember to name your call back method something other than parse.

我们正在使用父类的CrawlSpider子scrapy.Spider类
我们已设置为“规则”
a) 第一条规则，只检查是否有可用的“next_page”并遵循它。
b) 第二条规则请求页面上所有格式的链接/trains/12343，然后调用parse_trains执行和解析操作。
重要提示：请注意，我们不想parse在这里使用常规方法，因为我们使用的是CrawlSpider子类。这个类也有一个parse方法，所以我们不想覆盖它。请记住将您的回调方法命名为parse.

Python scrapy - 解析分页的项目

提问by AlexBrand

采纳答案by Achim

回答by bslima

回答by Santosh Pillai

相关推荐

最近更新

标签

Python scrapy - 解析分页的项目

提问by AlexBrand

采纳答案by Achim

回答by bslima

回答by Santosh Pillai

相关推荐

Python "\r" 在下面的脚本中有什么作用？

Python 在 argparse 中带有破折号的选项

Python 计算日期之间的天数，忽略周末

在 Python 中将列表转换为元组

相关推荐

最近更新

标签