python requests.get() 返回未正确解码的文本而不是 UTF-8?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44203397/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python requests.get() returns improperly decoded text instead of UTF-8?
提问by arunk2
When the content-type
of the server is 'Content-Type:text/html'
, requests.get()
returns improperly encoded data.
当content-type
服务器的 为 时'Content-Type:text/html'
,requests.get()
返回编码不正确的数据。
However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8'
, it returns properly encoded data.
但是,如果我们将内容类型显式设置为'Content-Type:text/html; charset=utf-8'
,它将返回正确编码的数据。
Also, when we use urllib.urlopen()
, it returns properly encoded data.
此外,当我们使用 时urllib.urlopen()
,它会返回正确编码的数据。
Has anyone noticed this before? Why does requests.get()
behave like this?
有没有人注意到这一点?为什么requests.get()
会有这样的行为?
回答by Dekel
From requests documentation:
从请求文档:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
当您发出请求时,Requests 会根据 HTTP 标头对响应的编码进行有根据的猜测。访问r.text时使用Requests猜测的文本编码。您可以使用 r.encoding 属性找出请求正在使用的编码,并更改它。
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.
检查用于您的页面的编码请求,如果它不正确 - 尝试强制它成为您需要的编码请求。
Regarding the differences between requests
and urllib.urlopen
- they probably use different ways to guess the encoding. Thats all.
关于之间的差异requests
和urllib.urlopen
-他们可能用不同的方式来猜测编码。就这样。
回答by bubak
Educated guesses(mentioned above) are probably just a check for Content-Type
header as being sent by server (quite misleading use of educatedimho).
受过教育的猜测(如上所述)可能只是对Content-Type
服务器发送的标头的检查(对受过教育的imho 的使用相当误导)。
For response header Content-Type: text/html
the result is ISO-8859-1(default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).
对于响应头Content-Type: text/html
,结果是ISO-8859-1(HTML4 的默认值),不管任何内容分析(即 HTML5 的默认值是 UTF-8)。
For response header Content-Type: text/html; charset=utf-8
the result is UTF-8.
对于响应头Content-Type: text/html; charset=utf-8
,结果是UTF-8。
Luckily for us, requestsuses chardetlibrary and that usually works quite well (attribute requests.Response.apparent_encoding
), so you usually want to do:
对我们来说幸运的是,requests使用chardet库并且通常效果很好(属性requests.Response.apparent_encoding
),所以你通常想要这样做:
r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text
回答by 9000
The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.
text/html 的默认假定内容编码是 ISO-8859-1 aka Latin-1 :( 请参阅 RFC-2854。UTF-8 还太年轻,无法成为默认值,它诞生于 1993 年,与 HTML 和HTTP。
Use .content
to access the byte stream, or .text
to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text
may be off.
使用.content
访问字节流或.text
访问解码Unicode流。如果 HTTP 服务器不关心正确的编码,则 的值.text
可能会关闭。
回答by Harry_pb
After getting response, take response.content
instead of response.text
and that will be of encoding utf-8
.
得到响应后,取response.content
而不是,response.text
那将是 encoding utf-8
。
response = requests.get(download_link, auth=(myUsername, myPassword), headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
body = response.content
else:
print ("Unable to get response with Code : %d " % (response.status_code))