如何从python 3中的url读取html

Question

提问by user1067305

I looked at previous similar questions and got only more confused.

我查看了以前的类似问题，但更加困惑。

In python 3.4, I want to read an html page as a string, given the url.

在python 3.4中，给定url，我想将html页面作为字符串读取。

In perl I do this with LWP::Simple, using get().

在 perl 中，我使用 LWP::Simple 执行此操作，使用 get()。

A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url). python3 can't find urlretrieve.

一个matplotlib 1.3.1例子说：import urllib; u1=urllib.urlretrieve(url)。python3 找不到urlretrieve.

I tried u1 = urllib.request.urlopen(url), which appears to get an HTTPResponseobject, but I can't print it or get a length on it or index it.

我试过u1 = urllib.request.urlopen(url)，它似乎得到了一个HTTPResponse对象，但我无法打印它，也无法获得它的长度或索引它。

u1.bodydoesn't exist. I can't find a description of the HTTPResponsein python3.

u1.body不存在。我HTTPResponse在 python3 中找不到对的描述。

Is there an attribute in the HTTPResponseobject which will give me the raw bytes of the html page?

HTTPResponse对象中是否有一个属性可以为我提供 html 页面的原始字节？

(Irrelevant stuff from other questions include urllib2, which doesn't exist in my python, csv parsers, etc.)

（来自其他问题的不相关内容包括urllib2，在我的 python、csv 解析器等中不存在）

Edit:

编辑：

I found something in a prior question which partially (mostly) does the job:

我在之前的问题中发现了一些（大部分）可以完成工作的内容：

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

I say 'partially' because I don't want to read separate lines, but just one big string.

我说“部分”是因为我不想阅读单独的行，而只想阅读一个大字符串。

I could just concatenate the lines, but every line printed has a character 'b' prepended to it.

我可以连接这些行，但打印的每一行前面都有一个字符“b”。

Where does that come from?

这是从哪里来的？

Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.

同样，我想我可以在连接之前删除第一个字符，但这确实是一个 kloodge。

Answer 1

回答by user1067305

urllib.request.urlopen(url).read()should return you the raw HTML page as a string.

urllib.request.urlopen(url).read()应该将原始 HTML 页面作为字符串返回。

Answer 2

回答by agamike

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

Answer 3

回答by davidgh

Note that Python3 does not read the html code as a string but as a bytearray, so you need to convert it to one with decode.

请注意，Python3 不会将 html 代码作为字符串读取，而是作为 . bytearray，因此您需要将其转换为带有decode.

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

Answer 4

回答by Aaron T.

Try the 'requests' module, it's much simpler.

试试“请求”模块，它要简单得多。

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

more info here > http://docs.python-requests.org/en/master/

更多信息在这里> http://docs.python-requests.org/en/master/

Answer 5

回答by Ramandeep Singh

import requests

url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

This will work similar to urllib.urlopen.

这将类似于urllib.urlopen.

Answer 6

回答by Discoveringmypath

Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.

使用 urllib 读取 html 页面相当简单。由于您想将其作为单个字符串阅读，因此我将向您展示。

Import urllib.request:

导入 urllib.request：

#!/usr/bin/python3.5

import urllib.request

Prepare our request

准备我们的请求

request = urllib.request.Request('http://www.w3schools.com')

Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.

请求网页时总是使用“尝试/除外”，因为事情很容易出错。urlopen() 请求页面。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.

Type 是一个很棒的函数，它会告诉我们变量是什么“类型”。这里， response 是一个 http.response 对象。

print(type(response))

The read function for our response object will store the html as bytes to our variable. Again type() will verify this.

我们的响应对象的读取函数会将 html 作为字节存储到我们的变量中。再次 type() 将验证这一点。

htmlBytes = response.read()

print(type(htmlBytes))

Now we use the decode function for our bytes variable to get a single string.

现在我们使用字节变量的 decode 函数来获取单个字符串。

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.

如果您确实想将此字符串拆分为单独的行，则可以使用 split() 函数来实现。在这种形式中，我们可以轻松迭代以打印出整个页面或进行任何其他处理。

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.

希望这提供了更详细的答案。Python 文档和教程很棒，我会将其用作参考，因为它可以回答您可能遇到的大多数问题。

如何从python 3中的url读取html

提问by user1067305

回答by user1067305

回答by agamike

回答by davidgh

回答by Aaron T.

回答by Ramandeep Singh

回答by Discoveringmypath

相关推荐

最近更新

标签

如何从python 3中的url读取html

提问by user1067305

回答by user1067305

回答by agamike

回答by davidgh

回答by Aaron T.

回答by Ramandeep Singh

回答by Discoveringmypath

相关推荐

Python time.sleep() 与 event.wait()

Python中的uWSGI请求超时

Python 重新采样一个 numpy 数组

使用 OpenCV 和 Python-2.7 进行屏幕截图

相关推荐

最近更新

标签