Python Scrapy 非常基本的例子

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18838494/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 12:01:36  来源:igfitidea点击:

Scrapy Very Basic Example

pythonweb-scrapingscrapy

提问by B.Mr.W.

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first exampleon their web.

嗨,我在我的 Mac 上安装了 Python Scrapy,我试图按照他们网站上的第一个示例进行操作。

They were trying to run the command:

他们试图运行命令:

scrapy crawl mininova.org -o scraped_data.json -t json

I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

我不太明白这是什么意思?看起来scrapy原来是一个单独的程序。我不认为他们有一个叫做 crawl 的命令。在示例中,它们有一段代码,即类 MininovaSpider 和 TorrentItem 的定义。我不知道这两个类应该去哪里,去同一个文件,这个python文件的名称是什么?

采纳答案by Michael0x2a

You may have better luck looking through the tutorialfirst, as opposed to the "Scrapy at a glance" webpage.

与“Scrapy 一目了然”网页相反,您可能会更幸运地先浏览本教程

The tutorial implies that Scrapy is, in fact, a separate program.

本教程暗示 Scrapy 实际上是一个单独的程序。

Running the command scrapy startproject tutorialwill create a folder called tutorialseveral files already set up for you.

运行该命令scrapy startproject tutorial将创建一个名为tutorial多个已为您设置的文件的文件夹。

For example, in my case, the modules/packages items, pipelines, settingsand spidershave been added to the root package tutorial.

例如,在我的情况下,所述模块/包itemspipelinessettingsspiders已被添加到根包tutorial

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

The TorrentItemclass would be placed inside items.py, and the MininovaSpiderclass would go inside the spidersfolder.

TorrentItem班将被放在里面items.py,而MininovaSpider类会去里面spiders的文件夹。

Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:

一旦项目设置好,Scrapy 的命令行参数就显得相当简单了。它们采用以下形式:

scrapy crawl <website-name> -o <output-file> -t <output-type>

Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspidercommand:

或者,如果你想在没有创建项目目录的开销的情况下运行scrapy,你可以使用runspider命令:

scrapy runspider my_spider.py

回答by alecxe

TL;DR:see Self-contained minimum example script to run scrapy.

TL;DR:请参阅运行 scrapy 的自包含最小示例脚本

First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiderspackage etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.

首先,拥有一个带有单独.cfg, settings.py, pipelines.py, items.py,spiders包等的普通 Scrapy 项目是保持和处理您的网络抓取逻辑的推荐方式。它提供了模块化、关注点分离,使事情保持组织、清晰和可测试。

If you are following the official Scrapy tutorialto create a project, you are running web-scraping via a special scrapycommand-line tool:

如果您按照官方 Scrapy 教程创建项目,则您正在通过特殊的scrapy命令行工具运行网络抓取:

scrapy crawl myspider


But, Scrapyalso provides an APIto run crawling from a script.

但是,Scrapy提供了一个 API运行从脚本抓取

There are several key concepts that should be mentioned:

有几个关键概念需要提及:

  • Settingsclass- basically a key-value "container" which is initialized with default built-in values
  • Crawlerclass- the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
  • Twisted reactor- since Scrapy is built-in on top of twistedasynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
  • Settingsclass- 基本上是一个使用默认内置值初始化的键值“容器”
  • Crawlerclass- 主要类,它就像是使用 Scrapy 进行网页抓取所涉及的所有不同组件的粘合剂
  • Twisted reactor- 由于 Scrapy 内置于twisted异步网络库之上- 要启动爬虫,我们需要将其放入Twisted Reactor,简单来说,就是一个事件循环:

The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and dispatches events or messages in a program. It works by calling some internal or external “event provider”, which generally blocks until an event has arrived, and then calls the relevant event handler (“dispatches the event”). The reactor provides basic interfaces to a number of services, including network communications, threading, and event dispatching.

反应器是 Twisted 中事件循环的核心 - 使用 Twisted 驱动应用程序的循环。事件循环是一种在程序中等待和分派事件或消息的编程结构。它通过调用一些内部或外部的“事件提供者”来工作,它通常会阻塞直到事件到达,然后调用相关的事件处理程序(“调度事件”)。反应器为许多服务提供基本接口,包括网络通信、线程和事件调度。

Here is a basic and simplified process of running Scrapy from script:

这是从脚本运行 Scrapy 的基本和简化过程:

  • create a Settingsinstance (or use get_project_settings()to use existing settings):

    settings = Settings()  # or settings = get_project_settings()
    
  • instantiate Crawlerwith settingsinstance passed in:

    crawler = Crawler(settings)
    
  • instantiate a spider (this is what it is all about eventually, right?):

    spider = MySpider()
    
  • configure signals. This is an important step if you want to have a post-processing logic, collect statsor, at least, to ever finish crawling since the twisted reactorneeds to be stopped manually. Scrapy docs suggest to stop the reactorin the spider_closedsignalhandler:

  • 创建一个Settings实例(或用于get_project_settings()使用现有设置):

    settings = Settings()  # or settings = get_project_settings()
    
  • Crawler使用settings传入的实例进行实例化:

    crawler = Crawler(settings)
    
  • 实例化一个蜘蛛(这就是最终的全部内容,对吧?):

    spider = MySpider()
    
  • 配置信号。如果您想拥有后处理逻辑、收集统计数据,或者至少要完成爬行,这是一个重要的步骤,因为reactor需要手动停止扭曲。Scrapy 文档建议reactorspider_closed信号处理程序中停止:

Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by connecting a handler to the signals.spider_closed signal.

请注意,在蜘蛛完成后,您还必须自己关闭 Twisted 反应器。这可以通过将处理程序连接到signals.spider_closed 信号来实现。

def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()
    # stats here is a dictionary of crawling stats that you usually see on the console        

    # here we need to stop the reactor
    reactor.stop()

crawler.signals.connect(callback, signal=signals.spider_closed)
  • configure and start crawler instance with a spider passed in:

    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    
  • optionally start logging:

    log.start()
    
  • start the reactor - this would block the script execution:

    reactor.run()
    
  • 使用传入的蜘蛛配置并启动爬虫实例:

    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    
  • 可选地开始记录

    log.start()
    
  • 启动反应器 - 这将阻止脚本执行:

    reactor.run()
    

Here is an example self-contained script that is using DmozSpiderspiderand involves item loaderswith input and output processorsand item pipelines:

这是一个使用DmozSpider蜘蛛的自包含脚本示例,它涉及带有输入和输出处理器项目管道的项目加载

import json

from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor


# define an item class
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()


# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

    desc_out = Join()


# define a pipeline
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


# define a spider
class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
            loader.add_xpath('title', 'a/text()')
            loader.add_xpath('link', 'a/@href')
            loader.add_xpath('desc', 'text()')
            yield loader.load_item()


# callback fired when the spider is closed
def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # collect/log stats?

    # stop the reactor
    reactor.stop()


# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
    '__main__.JsonWriterPipeline': 100
})

# instantiate a crawler passing in settings
crawler = Crawler(settings)

# instantiate a spider
spider = DmozSpider()

# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()

# start logging
log.start()

# start the reactor (blocks execution)
reactor.run()

Run it in a usual way:

以通常的方式运行它:

python runner.py

and observe items exported to items.jlwith the help of the pipeline:

items.jl在管道的帮助下观察导出的项目:

{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...

Gist is available here (feel free to improve):

要点在这里可用(随意改进):



Notes:

笔记:

If you define settingsby instantiating a Settings()object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMITor tweak any other setting, you need to either set it in the script via settings.set()(as demonstrated in the example):

如果您settings通过实例化Settings()对象来定义- 您将获得所有默认的 Scrapy 设置。但是,例如,如果您想配置现有管道,或者配置DEPTH_LIMIT或调整任何其他设置,则需要通过以下方式在脚本中进行设置settings.set()(如示例所示):

pipelines = {
    'mypackage.pipelines.FilterPipeline': 100,
    'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')

or, use an existing settings.pywith all the custom settings preconfigured:

或者,使用settings.py预先配置了所有自定义设置的现有:

from scrapy.utils.project import get_project_settings

settings = get_project_settings()


Other useful links on the subject:

关于该主题的其他有用链接: