Python 使用 Scrapy 从网站查找和下载 pdf 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36135809/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Scrapy to to find and download pdf files from a website
提问by Murface
I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:
我的任务是使用 Scrapy 从网站中提取 pdf 文件。我对 Python 并不陌生,但 Scrapy 对我来说很陌生。我一直在试验控制台和一些基本的蜘蛛。我找到并修改了这段代码:
import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.pwc.com"]
start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]
def parse(self, response):
base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)
I run this code at the command line with
我在命令行中运行此代码
scrapy crawl mySpider
and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.
我什么也得不到。我没有创建scrapy项目,因为我想抓取和下载文件,没有元数据。我将不胜感激。
回答by starrify
The spider logic seems incorrect.
蜘蛛逻辑似乎不正确。
I had a quick look at your website, and seems there are several types of pages:
我快速浏览了您的网站,似乎有几种类型的页面:
- http://www.pwc.com/us/en/tax-services/publications/research-and-insights.htmlthe initial page
- Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.htmlwhich could be navigated from page #1
- Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdfwhich could be navigated from page #2
- http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html初始页面
- 特定文章的网页,例如http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html可以从第 1 页
- 实际的 PDF 位置,例如http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf可以从第 2 页导航
Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.
因此正确的逻辑看起来像:首先获取#1 页面,然后获取#2 页面,然后我们可以下载那些#3 页面。
但是,您的蜘蛛尝试直接从 #1 页面中提取指向 #3 页面的链接。
EDITED:
编辑:
I have updated your code, and here's something that actually works:
我已经更新了你的代码,这里有一些实际有效的东西:
import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.pwc.com"]
start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]
def parse(self, response):
for href in response.css('div#all_results h3 a::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.parse_article
)
def parse_article(self, response):
for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)