python requests.get() 返回未正确解码的文本而不是 UTF-8？

Question

提问by arunk2

When the content-typeof the server is 'Content-Type:text/html', requests.get()returns improperly encoded data.

当content-type服务器的为时'Content-Type:text/html'，requests.get()返回编码不正确的数据。

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

但是，如果我们将内容类型显式设置为'Content-Type:text/html; charset=utf-8'，它将返回正确编码的数据。

Also, when we use urllib.urlopen(), it returns properly encoded data.

此外，当我们使用时urllib.urlopen()，它会返回正确编码的数据。

Has anyone noticed this before? Why does requests.get()behave like this?

有没有人注意到这一点？为什么requests.get()会有这样的行为？

Answer 1

回答by Dekel

From requests documentation:

从请求文档：

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

当您发出请求时，Requests 会根据 HTTP 标头对响应的编码进行有根据的猜测。访问r.text时使用Requests猜测的文本编码。您可以使用 r.encoding 属性找出请求正在使用的编码，并更改它。

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

检查用于您的页面的编码请求，如果它不正确 - 尝试强制它成为您需要的编码请求。

Regarding the differences between requestsand urllib.urlopen- they probably use different ways to guess the encoding. Thats all.

关于之间的差异requests和urllib.urlopen-他们可能用不同的方式来猜测编码。就这样。

Answer 2

回答by bubak

Educated guesses(mentioned above) are probably just a check for Content-Typeheader as being sent by server (quite misleading use of educatedimho).

受过教育的猜测（如上所述）可能只是对Content-Type服务器发送的标头的检查（对受过教育的imho 的使用相当误导）。

For response header Content-Type: text/htmlthe result is ISO-8859-1(default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

对于响应头Content-Type: text/html，结果是ISO-8859-1（HTML4 的默认值），不管任何内容分析（即 HTML5 的默认值是 UTF-8）。

For response header Content-Type: text/html; charset=utf-8the result is UTF-8.

对于响应头Content-Type: text/html; charset=utf-8，结果是UTF-8。

Luckily for us, requestsuses chardetlibrary and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

对我们来说幸运的是，requests使用chardet库并且通常效果很好（属性requests.Response.apparent_encoding），所以你通常想要这样做：

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text

Answer 3

回答by 9000

The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.

text/html 的默认假定内容编码是 ISO-8859-1 aka Latin-1 :( 请参阅 RFC-2854。UTF-8 还太年轻，无法成为默认值，它诞生于 1993 年，与 HTML 和HTTP。

Use .contentto access the byte stream, or .textto access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .textmay be off.

使用.content访问字节流或.text访问解码Unicode流。如果 HTTP 服务器不关心正确的编码，则的值.text可能会关闭。

Answer 4

回答by Harry_pb

After getting response, take response.contentinstead of response.textand that will be of encoding utf-8.

得到响应后，取response.content而不是，response.text那将是 encoding utf-8。

response = requests.get(download_link, auth=(myUsername, myPassword),  headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
    body = response.content
else:
    print ("Unable to get response with Code : %d " % (response.status_code))

python requests.get() 返回未正确解码的文本而不是 UTF-8？

提问by arunk2

回答by Dekel

回答by bubak

回答by 9000

回答by Harry_pb

相关推荐

最近更新

标签

python requests.get() 返回未正确解码的文本而不是 UTF-8？

提问by arunk2

回答by Dekel

回答by bubak

回答by 9000

回答by Harry_pb

相关推荐

从python中的文件加载json后检查密钥是否丢失

Python '+=' 和 '==+' 的区别？

收到的标签值 1 超出了 [0, 1) 的有效范围 - Python、Keras

Python Spyder 缺少对象检查器

相关推荐

最近更新

标签