如何使用python获取给定url的原始html文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28610508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:30:35  来源:igfitidea点击:

how to get raw html text of a given url using python

pythonhtml

提问by aquaman

I'm using html2text in python to get raw text (tags included) of a HTML page by taking any URL but I'm getting an error.

我在 python 中使用 html2text 通过获取任何 URL 来获取 HTML 页面的原始文本(包括标签),但出现错误。

My code -

我的代码 -

import html2text
import urllib2

proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>@<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)

The error -

错误 -

Traceback (most recent call last):
  File "t.py", line 8, in <module>
    html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>

Can anyone explain what I'm doing wrong?

谁能解释我做错了什么?

回答by no?????z???

If you don't require SSL, this script in Python 2.7.xshould work:

如果你不需要 SSL,这个脚本Python 2.7.x应该可以工作:

import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()

and in Python 3.xuse urllib.requestinstead of urllib

并在Python 3.x使用中urllib.request而不是urllib

Because urllib2for Python 2, in Python 3 it was merged into urllib.

因为urllib2对于 Python 2,在 Python 3 中它被合并到urllib.

http://is required.

http://是必须的。