从网站下载所有 .pdf 文件的 Python/Java 脚本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21798405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 10:42:53  来源:igfitidea点击:

Python/Java script to download all .pdf files from a website

javapythonhtmldownload

提问by sudobangbang

I was wondering if it was possible to write a script that could programmatically go throughout a webpage and download all .pdf file links automatically. Before I start attempting on my own, I want to know whether or not this is possible.

我想知道是否可以编写一个脚本,该脚本可以通过编程方式遍历整个网页并自动下载所有 .pdf 文件链接。在我开始自己尝试之前,我想知道这是否可行。

Regards

问候

采纳答案by kender99

Yes it's possible. for downloading pdf files you don't even need to use Beautiful Soup or Scrapy.

是的,这是可能的。要下载 pdf 文件,您甚至不需要使用 Beautiful Soup 或 Scrapy。

Downloading from python is very straight forward Build a list of all linkpdf links & download them

从 python 下载非常简单 建立所有链接pdf链接的列表并下载它们

Reference to how to build a list of links: http://www.pythonforbeginners.com/code/regular-expression-re-findall

关于如何构建链接列表的参考:http: //www.pythonforbeginners.com/code/regular-expression-re-findall

If you need to crawl through several linked pages then maybe one of the frameworks might help If you are willing to build your own crawler here a great tutorial, which btw is also a good intro to Python. https://www.udacity.com/course/viewer#!/c-cs101

如果您需要抓取多个链接页面,那么其中一个框架可能会有所帮助 https://www.udacity.com/course/viewer#!/c-cs101

回答by Will

Yes, this is possible. This is called web scraping. For Python, there's various packages to help with this including scrapy, beautifulsoup, mechanize, as well as many others.

是的,这是可能的。这称为网络抓取。对于 Python,有各种包可以帮助解决这个问题,包括scrapy、beautifulsoup、mechanize 以及许多其他包。

回答by aovbros

Yes its possible.

是的,它可能。

In python it is simple; urllibwill help you to download files from net. For example:

在python中很简单; urllib将帮助您从网络下载文件。例如:

import urllib
urllib.url_retrive("http://example.com/helo.pdf","c://home")

Now you need to make a script that will find links ending with .pdf.

现在您需要制作一个脚本来查找以 .pdf 结尾的链接。

Example html page : Here's a link

示例 html 页面: 这是一个链接

You need to download html page and use a htmlparser or use a regular expression.

您需要下载 html 页面并使用 htmlparser 或使用正则表达式。

回答by Laxman

Use urllibto download files. For example:

使用urllib下载文件。例如:

import urllib

urllib.urlretrieve("http://...","file_name.pdf")

Sample script to find links ending with .pdf: https://github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py

查找以.pdf结尾的链接的示例脚本:https: //github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py