如何使用 Python 请求获取 pdf 文件名？

Question

提问by kramer65

I'm using the Python requests libto get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click downloadit already has a filename defined to save the pdf. How do I get this filename?

我正在使用 Python请求库从网络获取 PDF 文件。这工作正常，但我现在也想要原始文件名。如果我在 Firefox 中转到 PDF 文件并单击download它已经定义了一个文件名来保存 pdf。我如何获得这个文件名？

For example:

例如：

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headersfor anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

我检查了r.headers任何有趣的东西，但那里没有文件名。我实际上希望像r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with requests library?

有人知道如何使用请求库获取下载的 PDF 文件的文件名吗？

Answer 1

采纳答案by user3255354

It is specified in an http header content-disposition. So to extract the name you would do:

它在 http 标头中指定content-disposition。因此，要提取名称，您将执行以下操作：

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (remodule).

通过正则表达式（re模块）从字符串中提取的名称。

Answer 2

回答by Maksim Solovjov

Apparently, for this particular resource it is in:

显然，对于这个特定的资源，它位于：

r.headers['content-disposition']

Don't know if it is always the case, though.

不过不知道是不是一直这样。

Answer 3

回答by Nilpo

Building on some of the other answers, here's how I do it. If there isn't a Content-Dispositionheader, I parse it from the download URL:

基于其他一些答案，这就是我的做法。如果没有Content-Disposition标题，我从下载 URL 解析它：

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

可以说有更好的方法来解析 URL 字符串，但为了简单起见，我不想涉及更多的库。

Answer 4

回答by myildirim

You can use werkzeugfor options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

您可以使用werkzeug选项标题https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})

如何使用 Python 请求获取 pdf 文件名？

提问by kramer65

采纳答案by user3255354

回答by Maksim Solovjov

回答by Nilpo

回答by myildirim

相关推荐

最近更新

标签

如何使用 Python 请求获取 pdf 文件名？

提问by kramer65

采纳答案by user3255354

回答by Maksim Solovjov

回答by Nilpo

回答by myildirim

相关推荐

Python 如何关闭在 Pillow 中打开的图像？

Python 如何在flask中使用ajax调用上传文件

在没有时间的情况下在python中创建日期

Python 将 json.dumps 中的 utf-8 文本保存为 UTF8，而不是 \u 转义序列

相关推荐

最近更新

标签