使用 Python 请求模块下载并保存 PDF 文件

Question

提问by Jim

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.

我正在尝试从网站下载 PDF 文件并将其保存到磁盘。我的尝试要么因编码错误而失败，要么导致空白的 PDF。

In [1]: import requests

In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'

In [3]: response = requests.get(url)

In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
   ...:     f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
      1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2     f.write(response.text)
      3 

UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

In [5]: import codecs

In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
   ...:     f.write(response.text)
   ...:

I know it is a codec problem of some kind but I can't seem to get it to work.

我知道这是某种编解码器问题，但我似乎无法让它工作。

Answer 1

采纳答案by Kevin Guan

You should use response.contentin this case:

response.content在这种情况下你应该使用：

with open('/tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

From the document:

从文件：

You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

对于非文本请求，您还可以以字节形式访问响应正文：
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

So that means: response.textreturn the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.

所以这意味着：response.text将输出作为字符串对象返回，在下载文本文件时使用它。如 HTML 文件等。

And response.contentreturn the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.

并将response.content输出作为字节对象返回，在下载二进制文件时使用它。如PDF文件、音频文件、图像等。

You can also use response.rawinstead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:

您还可以使用response.raw代替。但是，当您要下载的文件很大时使用它。以下是您也可以在文档中找到的基本示例：

import requests

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)

with open('/tmp/metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

chunk_sizeis the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000bytes, write them into the file, and do this again, again and again, unless it finished.

chunk_size是您要使用的块大小。如果您将其设置为2000，则请求将下载该文件的第一个2000字节，将它们写入文件，然后一次又一次地执行此操作，除非它完成。

So this can save your RAM. But I'd prefer use response.contentinstead in this case since your file is small. As you can see use response.rawis complex.

因此，这可以节省您的 RAM。但我更喜欢response.content在这种情况下使用，因为您的文件很小。如您所见，使用response.raw很复杂。

Relates:

回答by Nima Sajedi

regarding Kevin answer to write in a folder tmp, it should be like this:

关于写在文件夹中的凯文答案tmp，应该是这样的：

with open('./tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

he forgot .before the address and of-course your folder tmpshould have been created already

他忘记.了地址，当然你的文件夹tmp应该已经创建了

Answer 3

回答by user6481870

In Python 3, I find pathlib is the easiest way to do this. Request's response.contentmarries up nicely with pathlib's _write_bytes_.

在 Python 3 中，我发现 pathlib 是最简单的方法。请求的response.content与 pathlib 的 _write_bytes_ 结合得很好。

from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)

Answer 4

回答by Duck Ling

Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.

请注意，我是初学者。如果我的解决方案有误，请随时纠正和/或让我知道。我也可能学到新东西。

My solution:

我的解决方案：

Change the downloadPath accordinglyto where you want your file to be saved. Feel free to use the absolute path too for your usage.

相应地将 downloadPath 更改为您想要保存文件的位置。您也可以随意使用绝对路径。

Save the below as downloadFile.py.

将以下内容另存为 downloadFile.py。

Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension

用法： python downloadFile.py url-of-the-file-to-download new-file-name.extension

Remember to add an extension!

记得添加扩展名！

Example usage: python downloadFile.py http://www.google.co.uk google.html

用法示例： python downloadFile.py http://www.google.co.uk google.html

import requests
import sys
import os

def downloadFile(url, fileName):
    with open(fileName, "wb") as file:
        response = requests.get(url)
        file.write(response.content)


scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]      
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')

Answer 5

回答by jugi

You can use urllib:

您可以使用 urllib：

import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")

使用 Python 请求模块下载并保存 PDF 文件

提问by Jim

采纳答案by Kevin Guan

回答by Nima Sajedi

回答by user6481870

回答by Duck Ling

回答by jugi

相关推荐

最近更新

标签

使用 Python 请求模块下载并保存 PDF 文件

提问by Jim

采纳答案by Kevin Guan

回答by Nima Sajedi

回答by user6481870

回答by Duck Ling

回答by jugi

相关推荐

理解python xgboost cv

Python多次重复错误

Python NameError：未定义名称“包含”

交互式 Python：尽管 line_profiler 已正确导入，但无法让 `%lprun` 工作

相关推荐

最近更新

标签