Python 使用 Scrapy 从网站查找和下载 pdf 文件

Question

提问by Murface

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:

我的任务是使用 Scrapy 从网站中提取 pdf 文件。我对 Python 并不陌生，但 Scrapy 对我来说很陌生。我一直在试验控制台和一些基本的蜘蛛。我找到并修改了这段代码：

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        with open(path, 'wb') as f:
            f.write(response.body)

I run this code at the command line with

我在命令行中运行此代码

scrapy crawl mySpider

and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.

我什么也得不到。我没有创建scrapy项目，因为我想抓取和下载文件，没有元数据。我将不胜感激。

Answer 1

回答by starrify

The spider logic seems incorrect.

蜘蛛逻辑似乎不正确。

I had a quick look at your website, and seems there are several types of pages:

我快速浏览了您的网站，似乎有几种类型的页面：

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.htmlthe initial page
Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.htmlwhich could be navigated from page #1
Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdfwhich could be navigated from page #2

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html初始页面
特定文章的网页，例如http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html可以从第 1 页
实际的 PDF 位置，例如http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf可以从第 2 页导航

Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.

因此正确的逻辑看起来像：首先获取#1 页面，然后获取#2 页面，然后我们可以下载那些#3 页面。
但是，您的蜘蛛尝试直接从 #1 页面中提取指向 #3 页面的链接。

EDITED:

编辑：

I have updated your code, and here's something that actually works:

我已经更新了你的代码，这里有一些实际有效的东西：

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Python 使用 Scrapy 从网站查找和下载 pdf 文件

提问by Murface

回答by starrify

相关推荐

最近更新

标签

Python 使用 Scrapy 从网站查找和下载 pdf 文件

提问by Murface

回答by starrify

相关推荐

Python 如何在 Jupyter Notebook 中做上标和下标？

如何限制 Python 中循环的迭代？

Python 如何使用图例和辅助 y 轴在同一图上绘制两个熊猫时间序列？

Python 如何使用 Scikit Learn 调整随机森林中的参数？

相关推荐

最近更新

标签