在 Python 中验证 URL
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22238090/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Validating URLs in Python
提问by mp94
I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.
我一直在试图找出验证 URL 的最佳方法是什么(特别是在 Python 中),但还没有真正找到答案。似乎没有一种已知的方法可以验证 URL,这取决于您认为可能需要验证哪些 URL。同样,我发现很难找到一个易于阅读的 URL 结构标准。我确实找到了 RFC 3986 和 3987,但它们包含的不仅仅是它的结构。
Am I missing something, or is there no one standard way to validate a URL?
我是否遗漏了什么,或者没有一种标准的方法来验证 URL?
回答by bgschiller
This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?
这看起来可能是How do you validate a URL with a regular expression in Python?
You should be able to use the urlparselibrary described there.
您应该能够使用urlparse那里描述的库。
>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')
call urlparseon the string you want to check and then make sure that the ParseResulthas attributes for schemeand netloc
调用urlparse要检查的字符串,然后确保ParseResult具有scheme和netloc
回答by mdw7326
Assuming you are using python 3, you could use urllib. The code would go something like this:
假设您使用的是 python 3,您可以使用 urllib。代码将是这样的:
import urllib.request as req
import urllib.parse as p
def foo():
url = 'http://bar.com'
request = req.Request(url)
try:
response = req.urlopen(request)
#response is now a string you can search through containing the page's html
except:
#The url wasn't valid
If there is no error on the line "response = ..." then the url is valid.
如果“response = ...”行上没有错误,则该 url 有效。
回答by Tony Hammack
回答by Hamza
you can also try using urllib.requestto validate by passing the URL in the urlopenfunction and catching the exception for URLError.
您还可以尝试urllib.request通过在urlopen函数中传递 URL并捕获URLError.
from urllib.request import urlopen, URLError
def validate_web_url(url="http://google"):
try:
urlopen(url)
return True
except URLError:
return False
This would return Falsein this case
这将return False在这种情况下
回答by Chris Modzelewski
The original question is a bit old, but you might also want to look at the Validator-Collectionlibrary I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:
最初的问题有点旧,但您可能还想查看我几个月前发布的Validator-Collection库。它包括高性能的基于正则表达式的 URL 验证,以符合 RFC 标准。一些细节:
- Tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, and 3.8
- No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy
remodule) - Unit tests that cover 100+ different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.
- 针对 Python 2.7、3.4、3.5、3.6、3.7 和 3.8 进行测试
- 不依赖 Python 3.x,Python 2.x 中的一种条件依赖(Python 2.x 的错误
re模块的直接替换) - 单元测试涵盖 100 多种不同的成功/失败 URL 模式,包括非标准字符等。几乎涵盖了我所能找到的 RFC 标准的整个范围。
It's also very easy to use:
它也非常易于使用:
from validator_collection import validators, checkers
checkers.is_url('http://www.stackoverflow.com')
# Returns True
checkers.is_url('not a valid url')
# Returns False
value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'
value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)
value = validators.url('https://123.12.34.56:1234')
# value set to 'https://123.12.34.56:1234'
value = validators.url('http://10.0.0.1')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)
value = validators.url('http://10.0.0.1', allow_special_ips = True)
# value set to 'http://10.0.0.1'
In addition, Validator-Collectionincludes about 60+ other validators, including IP addresses (IPv4 and IPv6), domains, and email addresses as well, so something folks might find useful.
此外,Validator-Collection包括大约 60 多个其他验证器,包括 IP 地址(IPv4 和 IPv6)、域和电子邮件地址,因此人们可能会觉得有用。

