python requests.get() 返回未正确解码的文本而不是 UTF-8?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44203397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 23:49:24  来源:igfitidea点击:

python requests.get() returns improperly decoded text instead of UTF-8?

pythonutf-8python-requests

提问by arunk2

When the content-typeof the server is 'Content-Type:text/html', requests.get()returns improperly encoded data.

content-type服务器的 为 时'Content-Type:text/html'requests.get()返回编码不正确的数据。

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

但是,如果我们将内容类型显式设置为'Content-Type:text/html; charset=utf-8',它将返回正确编码的数据。

Also, when we use urllib.urlopen(), it returns properly encoded data.

此外,当我们使用 时urllib.urlopen(),它会返回正确编码的数据。

Has anyone noticed this before? Why does requests.get()behave like this?

有没有人注意到这一点?为什么requests.get()会有这样的行为?

回答by Dekel

From requests documentation:

请求文档

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

当您发出请求时,Requests 会根据 HTTP 标头对响应的编码进行有根据的猜测。访问r.text时使用Requests猜测的文本编码。您可以使用 r.encoding 属性找出请求正在使用的编码,并更改它。

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

检查用于您的页面的编码请求,如果它不正确 - 尝试强制它成为您需要的编码请求。

Regarding the differences between requestsand urllib.urlopen- they probably use different ways to guess the encoding. Thats all.

关于之间的差异requestsurllib.urlopen-他们可能用不同的方式来猜测编码。就这样。

回答by bubak

Educated guesses(mentioned above) are probably just a check for Content-Typeheader as being sent by server (quite misleading use of educatedimho).

受过教育的猜测(如上所述)可能只是对Content-Type服务器发送的标头的检查(对受过教育的imho 的使用相当误导)。

For response header Content-Type: text/htmlthe result is ISO-8859-1(default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

对于响应头Content-Type: text/html,结果是ISO-8859-1(HTML4 的默认值),不管任何内容分析(即 HTML5 的默认值是 UTF-8)。

For response header Content-Type: text/html; charset=utf-8the result is UTF-8.

对于响应头Content-Type: text/html; charset=utf-8,结果是UTF-8

Luckily for us, requestsuses chardetlibrary and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

对我们来说幸运的是,requests使用chardet库并且通常效果很好(属性requests.Response.apparent_encoding),所以你通常想要这样做:

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text

回答by 9000

The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.

text/html 的默认假定内容编码是 ISO-8859-1 aka Latin-1 :( 请参阅 RFC-2854。UTF-8 还太年轻,无法成为默认值,它诞生于 1993 年,与 HTML 和HTTP。

Use .contentto access the byte stream, or .textto access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .textmay be off.

使用.content访问字节流或.text访问解码Unicode流。如果 HTTP 服务器不关心正确的编码,则 的值.text可能会关闭。

回答by Harry_pb

After getting response, take response.contentinstead of response.textand that will be of encoding utf-8.

得到响应后,取response.content而不是,response.text那将是 encoding utf-8

response = requests.get(download_link, auth=(myUsername, myPassword),  headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
    body = response.content
else:
    print ("Unable to get response with Code : %d " % (response.status_code))