从网站下载所有 .pdf 文件的 Python/Java 脚本

Question

提问by sudobangbang

I was wondering if it was possible to write a script that could programmatically go throughout a webpage and download all .pdf file links automatically. Before I start attempting on my own, I want to know whether or not this is possible.

我想知道是否可以编写一个脚本，该脚本可以通过编程方式遍历整个网页并自动下载所有 .pdf 文件链接。在我开始自己尝试之前，我想知道这是否可行。

Regards

问候

Answer 1

采纳答案by kender99

Yes it's possible. for downloading pdf files you don't even need to use Beautiful Soup or Scrapy.

是的，这是可能的。要下载 pdf 文件，您甚至不需要使用 Beautiful Soup 或 Scrapy。

Downloading from python is very straight forward Build a list of all linkpdf links & download them

从 python 下载非常简单建立所有链接pdf链接的列表并下载它们

Reference to how to build a list of links: http://www.pythonforbeginners.com/code/regular-expression-re-findall

关于如何构建链接列表的参考：http: //www.pythonforbeginners.com/code/regular-expression-re-findall

If you need to crawl through several linked pages then maybe one of the frameworks might help If you are willing to build your own crawler here a great tutorial, which btw is also a good intro to Python. https://www.udacity.com/course/viewer#!/c-cs101

如果您需要抓取多个链接页面，那么其中一个框架可能会有所帮助 https://www.udacity.com/course/viewer#!/c-cs101

Answer 2

回答by Will

Yes, this is possible. This is called web scraping. For Python, there's various packages to help with this including scrapy, beautifulsoup, mechanize, as well as many others.

是的，这是可能的。这称为网络抓取。对于 Python，有各种包可以帮助解决这个问题，包括scrapy、beautifulsoup、mechanize 以及许多其他包。

Answer 3

回答by aovbros

Yes its possible.

是的，它可能。

In python it is simple; urllibwill help you to download files from net. For example:

在python中很简单； urllib将帮助您从网络下载文件。例如：

import urllib
urllib.url_retrive("http://example.com/helo.pdf","c://home")

Now you need to make a script that will find links ending with .pdf.

现在您需要制作一个脚本来查找以 .pdf 结尾的链接。

Example html page : Here's a link

示例 html 页面：这是一个链接

You need to download html page and use a htmlparser or use a regular expression.

您需要下载 html 页面并使用 htmlparser 或使用正则表达式。

Answer 4

回答by Laxman

Use urllibto download files. For example:

使用urllib下载文件。例如：

import urllib

urllib.urlretrieve("http://...","file_name.pdf")

Sample script to find links ending with .pdf: https://github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py

查找以.pdf结尾的链接的示例脚本：https: //github.com/laxmanverma/Scripts/blob/master/samplePaperParser/DownloadSamplePapers.py

从网站下载所有 .pdf 文件的 Python/Java 脚本

提问by sudobangbang

采纳答案by kender99

回答by Will

回答by aovbros

回答by Laxman

相关推荐

最近更新

标签

从网站下载所有 .pdf 文件的 Python/Java 脚本

提问by sudobangbang

采纳答案by kender99

回答by Will

回答by aovbros

回答by Laxman

相关推荐

如何从 Java 中的 txt 文件中读取 sql 查询

Java 如何在 NetBeans IDE 中运行 servlet 程序？

Java 使用 X.509 公共证书加密和解密

Java “热代码替换”不起作用——Eclipse 不会更改 JBoss 上的任何代码

相关推荐

最近更新

标签