Python 如何安全地从 URL 获取文件扩展名?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4776924/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 17:27:45  来源:igfitidea点击:

How to safely get the file extension from a URL?

pythonfile

提问by frigg

Consider the following URLs

考虑以下 URL

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:8024

Whats the proper way to determine the file extensions (.m3u/.asx/.pls)? Obviously the last one doesn't have a file extension.

确定文件扩展名 (.m3u/.asx/.pls) 的正确方法是什么?显然,最后一个没有文件扩展名。

EDIT: I forgot to mention that m3u/asx/pls are playlists (textfiles) for audio streams and must be parsed differently. The goal determine the extension and then send the url to the proper parsing-function. E.g.

编辑:我忘了提到 m3u/asx/pls 是音频流的播放列表(文本文件),必须以不同的方式解析。目标确定扩展名,然后将 url 发送到正确的解析函数。例如


url = argv[1]
ext = GetExtension(url)
if ext == "pls":
  realurl = ParsePLS(url)
elif ext == "asx":
  realurl = ParseASX(url)
(etc.)
else:
  realurl = url
Play(realurl)
GetExtension() 应该返回文件扩展名(如果有),最好不要连接到 URL。

采纳答案by Greg Hewgill

The realproper way is to not use file extensions at all. Do a GET (or HEAD) request to the URL in question, and use the returned "Content-type" HTTP header to get the content type. File extensions are unreliable.

真正正确的方法是在所有不使用的文件扩展名。对相关 URL 执行 GET(或 HEAD)请求,并使用返回的“Content-type”HTTP 标头获取内容类型。文件扩展名不可靠。

See MIME types (IANA media types)for more information and a list of useful MIME types.

有关更多信息和有用 MIME 类型列表,请参阅MIME 类型(IANA 媒体类型)

回答by Spacedman

Use urlparse, that'll get most of the above sorted:

使用 urlparse,这将得到上面的大部分排序:

http://docs.python.org/library/urlparse.html

http://docs.python.org/library/urlparse.html

then split the "path" up. You might be able to split the path up using os.path.split, but your example 2 with the :8024 on the end needs manual handling. Are your file extensions always three letters? Or always letters and numbers? Use a regular expression.

然后拆分“路径”。您可能可以使用 os.path.split 拆分路径,但是您的示例 2 最后带有 :8024 需要手动处理。您的文件扩展名总是三个字母吗?还是总是字母和数字?使用正则表达式。

回答by payne

Use urlparseto parse the path out of the URL, then os.path.splitextto get the extension.

使用urlparse解析路径了URL,那么os.path.splitext得到扩展。

import urlparse, os

url = 'http://www.plssomeotherurl.com/station.pls?id=111'
path = urlparse.urlparse(url).path
ext = os.path.splitext(path)[1]

Note that the extension may not be a reliable indicator of the type of the file. The HTTP Content-Typeheader may be better.

请注意,扩展名可能不是文件类型的可靠指标。HTTPContent-Type标头可能更好。

回答by Laurence Gonsalves

File extensions are basically meaningless in URLs. For example, if you go to http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292do you want the extension to be ".py" despite the fact that the page is HTML, not Python?

文件扩展名在 URL 中基本上没有意义。例如,如果您访问http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292是否需要尽管页面是 HTML 而不是 Python,但扩展名是“.py”?

Use the Content-Type header to determine the "type" of a URL.

使用 Content-Type 标头确定 URL 的“类型”。

回答by Corey Goldberg

$ python3
Python 3.1.2 (release31-maint, Sep 17 2010, 20:27:33) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import splitext
>>> from urllib.parse import urlparse 
>>> 
>>> urls = [
...     'http://m3u.com/tunein.m3u',
...     'http://asxsomeurl.com/listen.asx:8024',
...     'http://www.plssomeotherurl.com/station.pls?id=111',
...     'http://22.198.133.16:8024',
... ]
>>> 
>>> for url in urls:
...     path = urlparse(url).path
...     ext = splitext(path)[1]
...     print(ext)
... 
.m3u
.asx:8024
.pls

>>> 

回答by DDC

To get the content-type you can write a function one like I have written using urllib2. If you need to utilize page content anyway it is likely that you will use urllib2 so no need to import os.

要获得内容类型,您可以编写一个函数,就像我使用 urllib2 编写的那样。如果您无论如何都需要使用页面内容,您很可能会使用 urllib2,因此无需导入 os.

import urllib2

def getContentType(pageUrl):
    page = urllib2.urlopen(pageUrl)
    pageHeaders = page.headers
    contentType = pageHeaders.getheader('content-type')
    return contentType

回答by Seth

This is easiest with requestsand mimetypes:

这是最简单的requestsmimetypes

import requests
import mimetypes

response = requests.get(url)
content_type = response.headers['content-type']
extension = mimetypes.guess_extension(content_type)

The extension includes a dot prefix. For example, extensionis '.png'for content type 'image/png'.

扩展名包括一个点前缀。例如,extension'.png'对于内容类型'image/png'

回答by tom mike

you can try the rfc6266module like:

您可以尝试使用rfc6266模块,例如:

import requests
import rfc6266

req = requests.head(downloadLink)
headersContent = req.headers['Content-Disposition']
rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe
filename = requests.utils.unquote(rfcFilename)

回答by Supergnaw

A different approach that takes nothing else into account except for the actual file extension from a url:

一种不同的方法,除了来自 url 的实际文件扩展名外,什么都不考虑:

def fileExt( url ):
    # compile regular expressions
    reQuery = re.compile( r'\?.*$', re.IGNORECASE )
    rePort = re.compile( r':[0-9]+', re.IGNORECASE )
    reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE )

    # remove query string
    url = reQuery.sub( "", url )

    # remove port
    url = rePort.sub( "", url )

    # extract extension
    matches = reExt.search( url )
    if None != matches:
        return matches.group( 1 )
    return None

edit: added handling of explicit ports from :1234

编辑:从 :1234 添加对显式端口的处理