Python 如何使用 `urlparse` 检查 URL 是否有效?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25259134/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:58:09  来源:igfitidea点击:

How can I check whether a URL is valid using `urlparse`?

pythonurllib2url-parsingurlparse

提问by Ziva

I want to check whether a URL is valid, before I open it to read data.

在打开 URL 以读取数据之前,我想检查它是否有效。

I was using the function urlparsefrom the urlparsepackage:

我用的功能,urlparseurlparse包:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

但是,我注意到一些有效的 URL 被视为已损坏,例如:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

此 URL 有效(我可以使用浏览器打开它)。

Is there a better way to check if the URL is valid?

有没有更好的方法来检查 URL 是否有效?

回答by vil

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http://to it.

没有架构的网址实际上是无效的,您的浏览器足够聪明,可以建议 http:// 作为它的架构。检查 url 是否没有架构 ( not re.match(r'^[a-zA-Z]+://', url)) 并http://在其前面添加可能是一个很好的解决方案。

回答by xbello

You can check if the url has the scheme:

您可以检查网址是否具有方案:

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a realvalid url:

如果是这种情况,您可以替换方案并获得真正有效的 url:

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'

回答by abdullahselek

You can try the function below which checks scheme, netlocand pathvariables which comes after parsing the url. Supports both Python 2 and 3.

你可以试试哪些检查下面的功能schemenetlocpath变量,来解析URL之后。支持 Python 2 和 3。

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

回答by John Paraskevopoulos

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

TL;DR:你实际上不能。给出的每个答案都已经错过了 1 个或多个案例。

  1. String is google.com(invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path])seems to work for this case
  2. String is http://google(invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path])seems to catch this case
  3. String is http://google.com/(correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path])works fine
  4. String is http://google.com(correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path])seems to give a false negative
  1. 字符串是google.com(无效,因为没有方案,即使浏览器默认使用 http)。Urlparse 将缺少 scheme 和 netloc。所以all([result.scheme, result.netloc, result.path])似乎适用于这种情况
  2. 字符串是http://google(无效,因为 .com 丢失)。Urlparse 将只缺少路径。all([result.scheme, result.netloc, result.path])似乎再次抓住了这个案子
  3. 字符串是http://google.com/(正确)。Urlparse 将填充 scheme、netloc 和 path。所以对于这种情况下all([result.scheme, result.netloc, result.path])工作正常
  4. 字符串是http://google.com(正确)。Urlparse 将只缺少路径。所以对于这个案例all([result.scheme, result.netloc, result.path])似乎给出了一个假阴性

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

因此,从上述案例中您可以看到,最接近解决方案的是all([result.scheme, result.netloc, result.path]). 但这仅适用于 url 包含路径的情况(即使那是 / 路径)。

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/"))you will still get a false positive in case 2

即使您尝试强制执行路径(即urlparse(urljoin(your_url, "/")),在情况 2 中您仍然会得到误报

Maybe something more complicated like

也许更复杂的东西

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URLto think even more cases

也许您还想跳过方案检查并假设没有方案则为 http。但即使这样也会让你达到一个目的。虽然它涵盖了上述情况,但它并没有完全涵盖 url 包含 ip 而不是主机名的情况。对于这种情况,您必须验证 ip 是否是正确的 ip。还有更多的场景。请参阅https://en.wikipedia.org/wiki/URL以考虑更多案例