Python 如何处理来自 urllib.request.urlopen() 的响应编码

Question

提问by kryptobs2000

I'm trying to search a webpage using regular expressions, but I'm getting the following error:

我正在尝试使用正则表达式搜索网页，但出现以下错误：

TypeError: can't use a string pattern on a bytes-like object

类型错误：不能在类似字节的对象上使用字符串模式

I understand why, urllib.request.urlopen() returns a bytestream and so, at least I'm guessing, re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

我明白为什么，urllib.request.urlopen() 返回一个字节流，所以，至少我猜，re 不知道要使用的编码。在这种情况下我该怎么办？有没有办法在 urlrequest 中指定编码方法，或者我是否需要自己重新编码字符串？如果是这样，我想做什么，我假设我应该从标头信息或编码类型（如果在 html 中指定）中读取编码，然后将其重新编码为该编码？

Answer 1

采纳答案by Senthil Kumaran

You just need to decode the response, using the Content-Typeheader typically the last value. There is an example given in the tutorialtoo.

您只需要解码响应，Content-Type通常使用标头作为最后一个值。教程中也给出了一个例子。

output = response.decode('utf-8')

Answer 2

回答by Jesse Cohen

after you make a request req = urllib.request.urlopen(...)you have to read the request by calling html_string = req.read()that will give you the string response that you can then parse the way you want.

发出请求后，req = urllib.request.urlopen(...)您必须通过调用读取请求，该请求html_string = req.read()将为您提供字符串响应，然后您可以按照自己的方式解析。

Answer 3

回答by wynemo

urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

将输出如下内容：

text/html; charset=utf-8

Answer 4

回答by Ivan Klass

As for me, the solution is as following (python3):

至于我，解决方案如下（python3）：

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

Answer 5

回答by pytohs

I had the same issues for the last two days. I finally have a solution. I'm using the info()method of the object returned by urlopen():

最近两天我遇到了同样的问题。我终于有了解决方案。我正在使用由info()返回的对象的方法urlopen()：

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

Answer 6

回答by xged

With requests:

随着请求：

import requests

response = requests.get(URL).text

Answer 7

回答by Asher

Here is an example simple http request (that I tested and works)...

这是一个简单的http请求示例（我测试过并有效）...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

请务必阅读文档。

https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

如果你想做一些更详细的 GET/POST REQUEST。

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Python 如何处理来自 urllib.request.urlopen() 的响应编码

提问by kryptobs2000

采纳答案by Senthil Kumaran

回答by Jesse Cohen

回答by wynemo

回答by Ivan Klass

回答by pytohs

回答by xged

回答by Asher

相关推荐

最近更新

标签

Python 如何处理来自 urllib.request.urlopen() 的响应编码

提问by kryptobs2000

采纳答案by Senthil Kumaran

回答by Jesse Cohen

回答by wynemo

回答by Ivan Klass

回答by pytohs

回答by xged

回答by Asher

相关推荐

Python 使用nosetests --pdb 选项设置断点

Python 使用 NLTK 创建新语料库

如何在 Python 中匹配精确的“多个”字符串

在 Python 中创建类似棋盘游戏的网格

相关推荐

最近更新

标签