使用 python-Scrapy 抓取动态内容

Question

提问by Pravesh Jain

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website.

免责声明：我在 StackOverflow 上看到了许多其他类似的帖子，并试图以同样的方式来做，但它们似乎在这个网站上不起作用。

I'm using Python-Scrapy for getting data from koovs.com.

我正在使用 Python-Scrapy 从 koovs.com 获取数据。

However, I'm not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the 'Not available' size tag from the drop-down menu on thislink, I'd be grateful.

但是，我无法获得动态生成的产品尺寸。具体来说，如果有人可以指导我从此链接的下拉菜单中获取“不可用”尺寸标签，我将不胜感激。

I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.

我能够静态获取尺寸列表，但这样做我只能获取尺寸列表，但不能获取其中哪些可用。

Answer 1

采纳答案by alecxe

You can also solve it with ScrapyJS(no need for seleniumand a real browser):

您也可以使用ScrapyJS（不需要selenium真正的浏览器）来解决它：

This library provides Scrapy+JavaScript integration using Splash.

该库使用 Splash 提供 Scrapy+JavaScript 集成。

Follow the installation instructions for Splashand ScrapyJS, start the splash docker container:

按照安装说明Splash和ScrapyJS，启动飞溅泊坞窗容器：

$ docker run -p 8050:8050 scrapinghub/splash

Put the following settings into settings.py:

将以下设置放入settings.py：

SPLASH_URL = 'http://192.168.59.103:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

And here is your sample spider that is able to see the size availability information:

这是您的示例蜘蛛，它能够查看尺寸可用性信息：

# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["koovs.com"]
    start_urls = (
        'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for option in response.css("div.select-size select.sizeOptions option")[1:]:
            print option.xpath("text()").extract()

Here is what is printed on the console:

这是控制台上打印的内容：

[u'S / 34 -- Not Available']
[u'L / 40 -- Not Available']
[u'L / 42']

Answer 2

回答by alecxe

From what I understand, the size availability is determined dynamically in javascript being executed in the browser. Scrapy is not a browser and cannot execute javascript.

据我了解，大小可用性是在浏览器中执行的 javascript 中动态确定的。Scrapy 不是浏览器，不能执行 javascript。

If you are okay with switching to seleniumbrowser automation tool, here is a sample code:

如果您可以切换到selenium浏览器自动化工具，这里是一个示例代码：

from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox()  # can be webdriver.PhantomJS()
browser.get('http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376')

# wait for the select element to become visible
select_element = WebDriverWait(browser, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.select-size select.sizeOptions")))

select = Select(select_element)
for option in select.options[1:]:
    print option.text

browser.quit()

It prints:

它打印：

S / 34 -- Not Available
L / 40 -- Not Available
L / 42

Note that in place of Firefoxyou can use other webdrivers like Chrome or Safari. There is also an option to use a headless PhantomJSbrowser.

请注意，Firefox您可以使用 Chrome 或 Safari 等其他网络驱动程序代替。还有一个选项可以使用无头PhantomJS浏览器。

You can also combine Scrapy with Selenium if needed, see:

如果需要，您还可以将 Scrapy 与 Selenium 结合使用，请参阅：

Answer 3

回答by Srivardhan Cholkar

I faced that problem and solved easily by following these steps

我遇到了这个问题，并按照以下步骤轻松解决

pip install splash
pip install scrapy-splash
pip install scrapyjs

pip安装飞溅
pip安装scrapy-splash
pip安装scrapyjs

download and install docker-toolbox

下载并安装docker-toolbox

open docker-quickterminal and enter

打开 docker-quickterminal 并输入

$ docker run -p 8050:8050 scrapinghub/splash

To set the SPLASH_URL check the default ip configured in the docker machine by entering
$ docker-machine ip default(My IP was 192.168.99.100)

要设置 SPLASH_URL，请通过输入检查 docker 机器中配置的默认 ip
$ docker-machine ip default(My IP was 192.168.99.100)

SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

That's it!

就是这样！

Answer 4

回答by Alexis Mejía

You have to interpret the json of the website, examples scrapy.readthedocsand testingcan.github.io

你必须解释网站的JSON，实例 scrapy.readthedocs和 testingcan.github.io

import scrapy
import json
class QuoteSpider(scrapy.Spider):
   name = 'quote'
   allowed_domains = ['quotes.toscrape.com']
   page = 1
   start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']

   def parse(self, response):
      data = json.loads(response.text)
      for quote in data["quotes"]:
        yield {"quote": quote["text"]}
      if data["has_next"]:
          self.page += 1
          url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page)
          yield scrapy.Request(url=url, callback=self.parse)

使用 python-Scrapy 抓取动态内容

提问by Pravesh Jain

采纳答案by alecxe

回答by alecxe

回答by Srivardhan Cholkar

回答by Alexis Mejía

相关推荐

最近更新

标签

使用 python-Scrapy 抓取动态内容

提问by Pravesh Jain

采纳答案by alecxe

回答by alecxe

回答by Srivardhan Cholkar

回答by Alexis Mejía

相关推荐

Python 将列表的元素提升到幂

Python theano - TensorVariable 的打印值

Python 如何从Selenium获取元素的属性？

Python 如何使用 matplotlib 绘制复数（Argand Diagram）

相关推荐

最近更新

标签