Python selenium 与scrapy 用于动态页面

Question

提问by Z. Lin

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

我正在尝试使用 scrapy 从网页中抓取产品信息。我要抓取的网页是这样的：

starts with a product_list page with 10 products
a click on "next" button loads the next 10 products (url doesn't change between the two pages)
i use LinkExtractor to follow each product link into the product page, and get all the information I need

从包含 10 个产品的 product_list 页面开始
单击“下一步”按钮会加载接下来的 10 个产品（两个页面之间的 url 不会更改）
我使用 LinkExtractor 跟随每个产品链接进入产品页面，并获取我需要的所有信息

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

我试图复制 next-button-ajax-call 但无法正常工作，所以我尝试使用 selenium。我可以在单独的脚本中运行 selenium 的 webdriver，但我不知道如何与 scrapy 集成。我应该把硒部分放在我的scrapy蜘蛛中吗？

My spider is pretty standard, like the following:

我的蜘蛛非常标准，如下所示：

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

Any idea is appreciated. Thank you!

任何想法表示赞赏。谢谢！

Answer 1

采纳答案by alecxe

It really depends on how do you need to scrape the site and how and what data do you want to get.

这实际上取决于您需要如何抓取网站以及您想要获取的数据和方式。

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

这是一个如何使用Scrapy+在 ebay 上跟踪分页的示例Selenium：

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

Here are some examples of "selenium spiders":

以下是“硒蜘蛛”的一些示例：

There is also an alternative to having to use Seleniumwith Scrapy. In some cases, using ScrapyJSmiddlewareis enough to handle the dynamic parts of a page. Sample real-world usage:

除了必须使用Seleniumwith 之外，还有一种替代方法Scrapy。在某些情况下，使用ScrapyJS中间件足以处理页面的动态部分。实际使用示例：

Scraping dynamic content using python-Scrapy

使用 python-Scrapy 抓取动态内容

Answer 2

回答by Expired Brain

If (url doesn't change between the two pages) then you should add dont_filter=Truewith your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.

如果（两个页面之间的 url 没有变化）那么你应该在你的 scrapy.Request() 中添加dont_filter=True或者在处理第一页后，scrapy 会发现这个 url 是重复的。

If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middlewarewhich can handle javascript pages using selenium or you can do that by launching any headless browser

如果你需要用 javascript 渲染页面，你应该使用scrapy-splash，你也可以检查这个可以使用 selenium 处理 javascript 页面的scrapy中间件，或者你可以通过启动任何无头浏览器来做到这一点

But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.

但更有效和更快的解决方案是检查您的浏览器并查看在提交表单或触发某个事件期间发出了什么请求。尝试模拟浏览器发送的相同请求。如果您可以正确复制请求，您将获得所需的数据。

Here is an example :

这是一个例子：

class ScrollScraper(Spider):
    name = "scrollingscraper"

    quote_url = "http://quotes.toscrape.com/api/quotes?page="
    start_urls = [quote_url + "1"]

    def parse(self, response):
        quote_item = QuoteItem()
        print response.body
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            quote_item["author"] = item.get('author', {}).get('name')
            quote_item['quote'] = item.get('text')
            quote_item['tags'] = item.get('tags')
            yield quote_item

        if data['has_next']:
            next_page = data['page'] + 1
            yield Request(self.quote_url + str(next_page))

When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest()instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.

当每个页面的分页 url 都相同并使用 POST 请求时，您可以使用scrapy.FormRequest()而不是scrapy.Request()，两者都是相同的，但 FormRequest向构造函数添加了一个新参数（formdata=）。

Here is another spider example form this post:

这是这篇文章的另一个蜘蛛示例：

class SpiderClass(scrapy.Spider):
    # spider name and all
    name = 'ajax'
    page_incr = 1
    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if self.page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))

        # your code here

        # pagination code starts here
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr += 1
            formdata = {
                'sorter': 'recent',
                'location': 'main loop',
                'loop': 'main loop',
                'action': 'sort',
                'view': 'grid',
                'columns': '3',
                'paginated': str(self.page_incr),
                'currentquery[category_name]': 'reviews'
            }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

Python selenium 与scrapy 用于动态页面

提问by Z. Lin

采纳答案by alecxe

回答by Expired Brain

相关推荐

最近更新

标签

Python selenium 与scrapy 用于动态页面

提问by Z. Lin

采纳答案by alecxe

回答by Expired Brain

相关推荐

Python Jinja2 圆形过滤器不四舍五入

没有编码的 Python 字符串参数

Python 大熊猫使用startswith从Dataframe中选择

Python 访问熊猫数据帧索引时出错

相关推荐

最近更新

标签