Python 检查网站是否存在

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16778435/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:39:00  来源:igfitidea点击:

Python check if website exists

pythonhtmlurlopen

提问by James Hallen

I wanted to check if a certain website exists, this is what I'm doing:

我想检查某个网站是否存在,这就是我正在做的:

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!

If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ...line to make sure that the page I'm reading does exit?

如果页面不存在(错误 402 或任何其他错误),我可以在该page = ...行中做什么以确保我正在阅读的页面确实退出?

采纳答案by Adem ?zta?

You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.

您可以使用 HEAD 请求而不是 GET。它只会下载标题,而不会下载内容。然后您可以从标题中检查响应状态。

import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
   print('web site exists')

or you can use urllib2

或者你可以使用 urllib2

import urllib2
try:
    urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
    print(e.code)
except urllib2.URLError, e:
    print(e.args)

or you can use requests

或者你可以使用 requests

import requests
request = requests.get('http://www.example.com')
if request.status_code == 200:
    print('Web site exists')
else:
    print('Web site does not exist') 

回答by alecxe

It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):

最好检查状态代码是否 < 400,就像在这里完成的一样。这是状态代码的含义(取自维基百科):

  • 1xx- informational
  • 2xx- success
  • 3xx- redirection
  • 4xx- client error
  • 5xx- server error
  • 1xx- 信息
  • 2xx- 成功
  • 3xx- 重定向
  • 4xx- 客户端错误
  • 5xx- 服务器错误

If you want to check if page exists and don't want to download the whole page, you should use Head Request:

如果你想检查页面是否存在并且不想下载整个页面,你应该使用Head Request

import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400

taken from this answer.

取自这个答案

If you want to download the whole page, just make a normal request and check the status code. Example using requests:

如果您想下载整个页面,只需发出正常请求并检查状态代码即可。使用请求的示例:

import requests

response = requests.get('http://google.com')
assert response.status_code < 400

See also similar topics:

另见类似主题:

Hope that helps.

希望有帮助。

回答by keas

from urllib2 import Request, urlopen, HTTPError, URLError

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
        page_open = urlopen(req)
except HTTPError, e:
        print e.code
except URLError, e:
        print e.reason
else:
        print 'ok'

To answer the comment of unutbu:

回答 unutbu 的评论:

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. Source

由于默认处理程序处理重定向(300 范围内的代码),而 100-299 范围内的代码表示成功,您通常只会看到 400-599 范围内的错误代码。 来源

回答by Raj

code:

代码:

a="http://www.example.com"
try:    
    print urllib.urlopen(a)
except:
    print a+"  site does not exist"

回答by DiegoPacheco

def isok(mypath):
    try:
        thepage = urllib.request.urlopen(mypath)
    except HTTPError as e:
        return 0
    except URLError as e:
        return 0
    else:
        return 1

回答by Vishal

Try this one::

试试这个::

import urllib2  
website='https://www.allyourmusic.com'  
try:  
    response = urllib2.urlopen(website)  
    if response.code==200:  
        print("site exists!")  
    else:  
        print("site doesn't exists!")  
except urllib2.HTTPError, e:  
    print(e.code)  
except urllib2.URLError, e:  
    print(e.args)  

回答by Maxfield

There is an excellent answer provided by @Adem ?zta?, for use with httpliband urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.

@Adem ?zta? 提供了一个很好的答案,用于httpliburllib2。对于requests,如果问题严格来说是关于资源存在的,那么在存在大量资源的情况下可以改进答案。

The previous answer for requestssuggested something like the following:

以前的答案requests建议如下:

def uri_exists_get(uri: str) -> bool:
    try:
        response = requests.get(uri)
        try:
            response.raise_for_status()
            return True
        except requests.exceptions.HTTPError:
            return False
    except requests.exceptions.ConnectionError:
        return False

requests.getattempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.

requests.get尝试一次拉取整个资源,因此对于大型媒体文件,上述代码段将尝试将整个媒体拉入内存。为了解决这个问题,我们可以流式传输响应。

def uri_exists_stream(uri: str) -> bool:
    try:
        with requests.get(uri, stream=True) as response:
            try:
                response.raise_for_status()
                return True
            except requests.exceptions.HTTPError:
                return False
    except requests.exceptions.ConnectionError:
        return False

I ran the above snippets with timers attached against two web resources:

我运行了上面的代码片段,并针对两个 Web 资源附加了计时器:

1) http://bbb3d.renderfarming.net/download.html, a very light html page

1) http://bbb3d.renderfarming.net/download.html,一个非常轻量级的 html 页面

2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file

2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4,一个大小合适的视频文件

Timing results below:

计时结果如下:

uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239

uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007

uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224

uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007

As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4"will return False.

最后一点:此功能也适用于资源主机不存在的情况。例如"http://abcdefghblahblah.com/test.mp4"将返回False.

回答by rusty

You can simply use streammethod to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.

您可以简单地使用stream方法不下载完整文件。就像在最新的 Python3 中一样,你不会得到 urllib2。最好使用经过验证的请求方法。这个简单的功能将解决您的问题。

def uri_exists(uri):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        return True
    else:
        return False