如何从python 3中的url读取html
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24153519/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read html from a url in python 3
提问by user1067305
I looked at previous similar questions and got only more confused.
我查看了以前的类似问题,但更加困惑。
In python 3.4, I want to read an html page as a string, given the url.
在python 3.4中,给定url,我想将html页面作为字符串读取。
In perl I do this with LWP::Simple, using get().
在 perl 中,我使用 LWP::Simple 执行此操作,使用 get()。
A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url)
.
python3 can't find urlretrieve
.
一个matplotlib 1.3.1例子说:import urllib; u1=urllib.urlretrieve(url)
。python3 找不到urlretrieve
.
I tried u1 = urllib.request.urlopen(url)
, which appears to get an HTTPResponse
object, but I can't print it or get a length on it or index it.
我试过u1 = urllib.request.urlopen(url)
,它似乎得到了一个HTTPResponse
对象,但我无法打印它,也无法获得它的长度或索引它。
u1.body
doesn't exist. I can't find a description of the HTTPResponse
in python3.
u1.body
不存在。我HTTPResponse
在 python3 中找不到对 的描述。
Is there an attribute in the HTTPResponse
object which will give me the raw bytes of the html page?
HTTPResponse
对象中是否有一个属性可以为我提供 html 页面的原始字节?
(Irrelevant stuff from other questions include urllib2
, which doesn't exist in my python, csv parsers, etc.)
(来自其他问题的不相关内容包括urllib2
,在我的 python、csv 解析器等中不存在)
Edit:
编辑:
I found something in a prior question which partially (mostly) does the job:
我在之前的问题中发现了一些(大部分)可以完成工作的内容:
u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')
for lines in u2.readlines():
print (lines)
I say 'partially' because I don't want to read separate lines, but just one big string.
我说“部分”是因为我不想阅读单独的行,而只想阅读一个大字符串。
I could just concatenate the lines, but every line printed has a character 'b' prepended to it.
我可以连接这些行,但打印的每一行前面都有一个字符“b”。
Where does that come from?
这是从哪里来的?
Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.
同样,我想我可以在连接之前删除第一个字符,但这确实是一个 kloodge。
回答by user1067305
urllib.request.urlopen(url).read()
should return you the raw HTML page as a string.
urllib.request.urlopen(url).read()
应该将原始 HTML 页面作为字符串返回。
回答by agamike
import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()
回答by davidgh
Note that Python3 does not read the html code as a string but as a bytearray
, so you need to convert it to one with decode
.
请注意,Python3 不会将 html 代码作为字符串读取,而是作为 . bytearray
,因此您需要将其转换为带有decode
.
import urllib.request
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
回答by Aaron T.
Try the 'requests' module, it's much simpler.
试试“请求”模块,它要简单得多。
#pip install requests for installation
import requests
url = 'https://www.google.com/'
r = requests.get(url)
r.text
more info here > http://docs.python-requests.org/en/master/
回答by Ramandeep Singh
import requests
url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)
This will work similar to urllib.urlopen
.
这将类似于urllib.urlopen
.
回答by Discoveringmypath
Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.
使用 urllib 读取 html 页面相当简单。由于您想将其作为单个字符串阅读,因此我将向您展示。
Import urllib.request:
导入 urllib.request:
#!/usr/bin/python3.5
import urllib.request
Prepare our request
准备我们的请求
request = urllib.request.Request('http://www.w3schools.com')
Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.
请求网页时总是使用“尝试/除外”,因为事情很容易出错。urlopen() 请求页面。
try:
response = urllib.request.urlopen(request)
except:
print("something wrong")
Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.
Type 是一个很棒的函数,它会告诉我们变量是什么“类型”。这里, response 是一个 http.response 对象。
print(type(response))
The read function for our response object will store the html as bytes to our variable. Again type() will verify this.
我们的响应对象的读取函数会将 html 作为字节存储到我们的变量中。再次 type() 将验证这一点。
htmlBytes = response.read()
print(type(htmlBytes))
Now we use the decode function for our bytes variable to get a single string.
现在我们使用字节变量的 decode 函数来获取单个字符串。
htmlStr = htmlBytes.decode("utf8")
print(type(htmlStr))
If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.
如果您确实想将此字符串拆分为单独的行,则可以使用 split() 函数来实现。在这种形式中,我们可以轻松迭代以打印出整个页面或进行任何其他处理。
htmlSplit = htmlStr.split('\n')
print(type(htmlSplit))
for line in htmlSplit:
print(line)
Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.
希望这提供了更详细的答案。Python 文档和教程很棒,我会将其用作参考,因为它可以回答您可能遇到的大多数问题。