Python 如何使用 `urlparse` 检查 URL 是否有效？

Question

提问by Ziva

I want to check whether a URL is valid, before I open it to read data.

在打开 URL 以读取数据之前，我想检查它是否有效。

I was using the function urlparsefrom the urlparsepackage:

我用的功能，urlparse从urlparse包：

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

但是，我注意到一些有效的 URL 被视为已损坏，例如：

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

此 URL 有效（我可以使用浏览器打开它）。

Is there a better way to check if the URL is valid?

有没有更好的方法来检查 URL 是否有效？

Answer 1

回答by vil

Url without schema is actually invalid, your browser is just clever enough to suggest http:// as schema for it. It may be a good solution to check if url doesn't have schema (not re.match(r'^[a-zA-Z]+://', url)) and prepend http://to it.

没有架构的网址实际上是无效的，您的浏览器足够聪明，可以建议 http:// 作为它的架构。检查 url 是否没有架构 ( not re.match(r'^[a-zA-Z]+://', url)) 并http://在其前面添加可能是一个很好的解决方案。

Answer 2

回答by xbello

You can check if the url has the scheme:

您可以检查网址是否具有方案：

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

If it's the case, you can replace the scheme and get a realvalid url:

如果是这种情况，您可以替换方案并获得真正有效的 url：

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'

Answer 3

回答by abdullahselek

You can try the function below which checks scheme, netlocand pathvariables which comes after parsing the url. Supports both Python 2 and 3.

你可以试试哪些检查下面的功能scheme，netloc和path变量，来解析URL之后。支持 Python 2 和 3。

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

Answer 4

回答by John Paraskevopoulos

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

TL;DR：你实际上不能。给出的每个答案都已经错过了 1 个或多个案例。

String is google.com(invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path])seems to work for this case
String is http://google(invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path])seems to catch this case
String is http://google.com/(correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path])works fine
String is http://google.com(correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path])seems to give a false negative

字符串是google.com（无效，因为没有方案，即使浏览器默认使用 http）。Urlparse 将缺少 scheme 和 netloc。所以all([result.scheme, result.netloc, result.path])似乎适用于这种情况
字符串是http://google（无效，因为 .com 丢失）。Urlparse 将只缺少路径。all([result.scheme, result.netloc, result.path])似乎再次抓住了这个案子
字符串是http://google.com/（正确）。Urlparse 将填充 scheme、netloc 和 path。所以对于这种情况下all([result.scheme, result.netloc, result.path])工作正常
字符串是http://google.com（正确）。Urlparse 将只缺少路径。所以对于这个案例all([result.scheme, result.netloc, result.path])似乎给出了一个假阴性

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

因此，从上述案例中您可以看到，最接近解决方案的是all([result.scheme, result.netloc, result.path]). 但这仅适用于 url 包含路径的情况（即使那是 / 路径）。

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/"))you will still get a false positive in case 2

即使您尝试强制执行路径（即urlparse(urljoin(your_url, "/"))，在情况 2 中您仍然会得到误报

Maybe something more complicated like

也许更复杂的东西

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URLto think even more cases

也许您还想跳过方案检查并假设没有方案则为 http。但即使这样也会让你达到一个目的。虽然它涵盖了上述情况，但它并没有完全涵盖 url 包含 ip 而不是主机名的情况。对于这种情况，您必须验证 ip 是否是正确的 ip。还有更多的场景。请参阅https://en.wikipedia.org/wiki/URL以考虑更多案例

Python 如何使用 `urlparse` 检查 URL 是否有效？

提问by Ziva

回答by vil

回答by xbello

回答by abdullahselek

回答by John Paraskevopoulos

相关推荐

最近更新

标签

Python 如何使用 `urlparse` 检查 URL 是否有效？

提问by Ziva

回答by vil

回答by xbello

回答by abdullahselek

回答by John Paraskevopoulos

相关推荐

Python 在 scikit-learn 中估算分类缺失值

在 Python 中检测元音与辅音

Python 转储到 JSON 添加了额外的双引号和引号转义

Python 如何按值（DESC）然后按键（ASC）对字典进行排序？

相关推荐

最近更新

标签