从网站下载所有 .pdf 文件的 Python/Java 脚本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21798405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python/Java script to download all .pdf files from a website
提问by sudobangbang
I was wondering if it was possible to write a script that could programmatically go throughout a webpage and download all .pdf file links automatically. Before I start attempting on my own, I want to know whether or not this is possible.
我想知道是否可以编写一个脚本,该脚本可以通过编程方式遍历整个网页并自动下载所有 .pdf 文件链接。在我开始自己尝试之前,我想知道这是否可行。
Regards
问候
采纳答案by kender99
Yes it's possible. for downloading pdf files you don't even need to use Beautiful Soup or Scrapy.
是的,这是可能的。要下载 pdf 文件,您甚至不需要使用 Beautiful Soup 或 Scrapy。
Downloading from python is very straight forward Build a list of all linkpdf links & download them
从 python 下载非常简单 建立所有链接pdf链接的列表并下载它们
Reference to how to build a list of links: http://www.pythonforbeginners.com/code/regular-expression-re-findall
关于如何构建链接列表的参考:http: //www.pythonforbeginners.com/code/regular-expression-re-findall
If you need to crawl through several linked pages then maybe one of the frameworks might help If you are willing to build your own crawler here a great tutorial, which btw is also a good intro to Python. https://www.udacity.com/course/viewer#!/c-cs101
如果您需要抓取多个链接页面,那么其中一个框架可能会有所帮助 https://www.udacity.com/course/viewer#!/c-cs101
回答by Will
回答by aovbros
Yes its possible.
是的,它可能。
In python it is simple;
urllib
will help you to download files from net.
For example:
在python中很简单;
urllib
将帮助您从网络下载文件。例如:
import urllib
urllib.url_retrive("http://example.com/helo.pdf","c://home")
Now you need to make a script that will find links ending with .pdf.
现在您需要制作一个脚本来查找以 .pdf 结尾的链接。
Example html page : Here's a link
示例 html 页面: 这是一个链接
You need to download html page and use a htmlparser or use a regular expression.
您需要下载 html 页面并使用 htmlparser 或使用正则表达式。
回答by Laxman
Use urllib
to download files. For example:
使用urllib
下载文件。例如:
import urllib
urllib.urlretrieve("http://...","file_name.pdf")
Sample script to find links ending with .pdf: https://github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py
查找以.pdf结尾的链接的示例脚本:https: //github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py