Python 使用 urllib 下载 pdf?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24844729/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:19:47  来源:igfitidea点击:

Download pdf using urllib?

pythonpdfurllib

提问by user3774185

I am trying to download a pdf file from a website using urllib. This is what i got so far:

我正在尝试使用 urllib 从网站下载 pdf 文件。这是我到目前为止得到的:

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

When i run this code, all I get is an empty pdf file. What am I doing wrong?

当我运行此代码时,我得到的只是一个空的 pdf 文件。我究竟做错了什么?

回答by shockburner

Change open('some_file.pdf', 'w')to open('some_file.pdf', 'wb'), pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.

更改open('some_file.pdf', 'w')open('some_file.pdf', 'wb'),pdf 文件是二进制文件,因此您需要“b”。几乎所有无法在文本编辑器中打开的文件都是如此。

回答by jamiemcg

Here is an example that works:

这是一个有效的示例:

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()

回答by romulomadu

Try to use urllib.retrieve(Python 3) and just do that:

尝试使用urllib.retrieve(Python 3)并这样做:

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

回答by Piyush Rumao

I would suggest using following lines of code

我建议使用以下代码行

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)

回答by Piyush Rumao

The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.

尝试了上面的代码,它们在某些情况下工作正常,但是对于某些嵌入了 pdf 的网站,您可能会收到类似HTTPError: HTTP Error 403: Forbidden 的错误。此类网站具有一些服务器安全功能,可以阻止已知的机器人。在 urllib 的情况下,它使用一个标头,它会说 ====> python urllib/3.3.0 之类的东西。所以我建议在 urllib 的请求模块中也添加一个自定义标头,如下所示。

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)