Python http下载页面源码

Question

提问by DonJuma

hello there i was wondering if it was possible to connect to a http host (I.e. for example google.com) and download the source of the webpage?

您好，我想知道是否可以连接到 http 主机（例如 google.com）并下载网页的源代码？

Thanks in advance.

提前致谢。

Answer 1

采纳答案by pyfunc

Using urllib2 to download a page.

使用 urllib2 下载页面。

Google will block this request as it will try to block all robots. Add user-agent to the request.

Google 会阻止此请求，因为它会尝试阻止所有机器人。将用户代理添加到请求中。

import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection

You can also use pyCurl

你也可以使用 pyCurl

import sys
import pycurl

class ContentCallback:
        def __init__(self):
                self.contents = ''

        def content_callback(self, buf):
                self.contents = self.contents + buf

t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents

Answer 2

回答by ghostdog74

You can use urllib2module.

您可以使用urllib2模块。

import urllib2
url = "http://somewhere.com"
page = urllib2.urlopen(url)
data = page.read()
print data

See the doc for more examples

有关更多示例，请参阅文档

Answer 3

回答by AndiDog

The documentation of httplib(low-level) and urllib(high-level) should get you started. Choose the one that's more suitable for you.

httplib（低级）和urllib（高级）的文档应该可以帮助您入门。选择一个更适合您的。

Answer 4

回答by tisaconundrum

so here's another approach to this problem using mechanize. I found this to bypass a website's robot checking system. i commented out the set_all_readonly because for some reason it wasn't recognized as a module in mechanize.

所以这是使用机械化解决这个问题的另一种方法。我发现这是为了绕过网站的机器人检查系统。我注释掉了 set_all_readonly ，因为由于某种原因它在机械化中未被识别为模块。

import mechanize
url = 'http://www.example.com'

br = mechanize.Browser()
#br.set_all_readonly(False)    # allow everything to be written to
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]           # [('User-agent', 'Firefox')]
response = br.open(url)
print response.read()      # the text of the page
response1 = br.response()  # get the response again
print response1.read()     # can apply lxml.html.fromstring()

Answer 5

回答by Orkun Berk Yuzbasioglu

Using requestspackage:

使用请求包：

# Import requests
import requests

#url
url = 'https://www.google.com/'

# Create the string html containing the HTML source
html = requests.get(url).content

Python http下载页面源码

提问by DonJuma

采纳答案by pyfunc

回答by ghostdog74

回答by AndiDog

回答by tisaconundrum

回答by Orkun Berk Yuzbasioglu

相关推荐

最近更新

标签

Python http下载页面源码

提问by DonJuma

采纳答案by pyfunc

回答by ghostdog74

回答by AndiDog

回答by tisaconundrum

回答by Orkun Berk Yuzbasioglu

相关推荐

如何在 Python 中反转列表？

Python 如何替换或删除 HTML 实体，如“ ” 使用 BeautifulSoup 4

Python 调用函数时将列表转换为 *args

如何使用 Python 读取 URL 的内容？

相关推荐

最近更新

标签