Python 使用 urllib 下载 pdf？

Question

提问by user3774185

I am trying to download a pdf file from a website using urllib. This is what i got so far:

我正在尝试使用 urllib 从网站下载 pdf 文件。这是我到目前为止得到的：

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

When i run this code, all I get is an empty pdf file. What am I doing wrong?

当我运行此代码时，我得到的只是一个空的 pdf 文件。我究竟做错了什么？

Answer 1

回答by shockburner

Change open('some_file.pdf', 'w')to open('some_file.pdf', 'wb'), pdf files are binary files so you need the 'b'. This is true with pretty much any file that you can't open in a text editor.

更改open('some_file.pdf', 'w')为open('some_file.pdf', 'wb')，pdf 文件是二进制文件，因此您需要“b”。几乎所有无法在文本编辑器中打开的文件都是如此。

Answer 2

回答by jamiemcg

Here is an example that works:

这是一个有效的示例：

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()

Answer 3

回答by romulomadu

Try to use urllib.retrieve(Python 3) and just do that:

尝试使用urllib.retrieve（Python 3）并这样做：

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

Answer 4

回答by Piyush Rumao

I would suggest using following lines of code

我建议使用以下代码行

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)

Answer 5

回答by Piyush Rumao

The tried the above code, they work fine in some cases, but for some website with pdf embedded in it, you might get an error like HTTPError: HTTP Error 403: Forbidden. Such websites have some server security features which will block known bots. In case of urllib it uses a header which will say something like ====> python urllib/3.3.0. So I would suggest adding a custom header too in request module of urllib as shown below.

尝试了上面的代码，它们在某些情况下工作正常，但是对于某些嵌入了 pdf 的网站，您可能会收到类似HTTPError: HTTP Error 403: Forbidden 的错误。此类网站具有一些服务器安全功能，可以阻止已知的机器人。在 urllib 的情况下，它使用一个标头，它会说 ====> python urllib/3.3.0 之类的东西。所以我建议在 urllib 的请求模块中也添加一个自定义标头，如下所示。

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)

Python 使用 urllib 下载 pdf？

提问by user3774185

回答by shockburner

回答by jamiemcg

回答by romulomadu

回答by Piyush Rumao

回答by Piyush Rumao

相关推荐

最近更新

标签

Python 使用 urllib 下载 pdf？

提问by user3774185

回答by shockburner

回答by jamiemcg

回答by romulomadu

回答by Piyush Rumao

回答by Piyush Rumao

相关推荐

Python 使用不同列的不同格式将 Pandas DataFrame 写入 Excel

Python 矩阵的 numpy 平方和

从提交在 Django 中运行 python 脚本

Python pip 在哪里安装它的包？

相关推荐

最近更新

标签