如何使用 Python 请求获取 pdf 文件名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31804799/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:33:37  来源:igfitidea点击:

How to get pdf filename with Python requests?

pythonpdfpython-requestsfilenames

提问by kramer65

I'm using the Python requests libto get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click downloadit already has a filename defined to save the pdf. How do I get this filename?

我正在使用 Python请求库从网络获取 PDF 文件。这工作正常,但我现在也想要原始文件名。如果我在 Firefox 中转到 PDF 文件并单击download它已经定义了一个文件名来保存 pdf。我如何获得这个文件名?

For example:

例如:

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headersfor anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

我检查了r.headers任何有趣的东西,但那里没有文件名。我实际上希望像r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with requests library?

有人知道如何使用请求库获取下载的 PDF 文件的文件名吗?

采纳答案by user3255354

It is specified in an http header content-disposition. So to extract the name you would do:

它在 http 标头中指定content-disposition。因此,要提取名称,您将执行以下操作:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (remodule).

通过正则表达式(re模块)从字符串中提取的名称。

回答by Maksim Solovjov

Apparently, for this particular resource it is in:

显然,对于这个特定的资源,它位于:

r.headers['content-disposition']

Don't know if it is always the case, though.

不过不知道是不是一直这样。

回答by Nilpo

Building on some of the other answers, here's how I do it. If there isn't a Content-Dispositionheader, I parse it from the download URL:

基于其他一些答案,这就是我的做法。如果没有Content-Disposition标题,我从下载 URL 解析它:

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

可以说有更好的方法来解析 URL 字符串,但为了简单起见,我不想涉及更多的库。

回答by myildirim

You can use werkzeugfor options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

您可以使用werkzeug选项标题https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})