在 Python 中获取 HTTP 响应的字符集/编码的好方法

Question

提问by Clay Wardell

Looking for an easy way to get the charset/encoding information of an HTTP response using Python urllib2, or any other Python library.

寻找一种使用 Python urllib2 或任何其他 Python 库获取 HTTP 响应的字符集/编码信息的简单方法。

>>> url = 'http://some.url.value'
>>> request = urllib2.Request(url)
>>> conn = urllib2.urlopen(request)
>>> response_encoding = ?

I know that it is sometimes present in the 'Content-Type' header, but that header has other information, and it's embedded in a string that I would need to parse. For example, the Content-Type header returned by Google is

我知道它有时会出现在“Content-Type”标头中，但该标头还有其他信息，并且它嵌入在我需要解析的字符串中。例如，Google 返回的 Content-Type 标头是

>>> conn.headers.getheader('content-type')
'text/html; charset=utf-8'

I could work with that, but I'm not sure how consistent the format will be. I'm pretty sure it's possible for charset to be missing entirely, so I'd have to handle that edge case. Some kind of string split operation to get the 'utf-8' out of it seems like it has to be the wrong way to do this kind of thing.

我可以使用它，但我不确定格式的一致性。我很确定字符集可能完全丢失，所以我必须处理这种边缘情况。某种字符串拆分操作可以将 'utf-8' 去掉，这似乎是做这种事情的错误方法。

>>> content_type_header = conn.headers.getheader('content-type')
>>> if '=' in content_type_header:
>>>  charset = content_type_header.split('=')[1]

That's the kind of code that feels like it's doing too much work. I'm also not sure if it will work in every case. Does anyone have a better way to do this?

那种感觉像是做了太多工作的代码。我也不确定它是否适用于所有情况。有没有人有更好的方法来做到这一点？

Answer 1

采纳答案by jfs

To parse http header you could use cgi.parse_header():

要解析 http 标头，您可以使用cgi.parse_header()：

_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8

Or using the response object:

或者使用响应对象：

response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)

In general the server may lie about the encoding or do not report it at all (the default depends on content-type) or the encoding might be specified inside the response body e.g., <meta>element in html documents or in xml declaration for xml documents. As a last resort the encoding could be guessed from the content itself.

通常，服务器可能会撒谎或根本不报告它（默认值取决于内容类型），或者可能在响应正文中指定编码，例如，<meta>html 文档中的元素或 xml 文档的 xml 声明中的元素。作为最后的手段，可以从内容本身猜测编码。

You could use requeststo get Unicode text:

您可以requests用来获取 Unicode 文本：

import requests # pip install requests

r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding

Or BeautifulSoupto parse html (and convert to Unicode as a side-effect):

或者BeautifulSoup解析 html（并转换为 Unicode 作为副作用）：

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...

Or bs4.UnicodeDammitdirectlyfor arbitrary content (not necessarily an html):

或bs4.UnicodeDammit直接用于任意内容（不一定是 html）：

from bs4 import UnicodeDammit

dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8

Answer 2

回答by dnozay

The requestslibrary makes this easy:

该requests库使这变得容易：

>>> import requests
>>> r = requests.get('http://some.url.value')
>>> r.encoding
'utf-8' # e.g.

Answer 3

回答by Cees Timmerman

Charsets can be specified in many ways, but it's often done so in the headers.

可以通过多种方式指定字符集，但通常在标题中指定。

>>> urlopen('http://www.python.org/').info().get_content_charset()
'utf-8'
>>> urlopen('http://www.google.com/').info().get_content_charset()
'iso-8859-1'
>>> urlopen('http://www.python.com/').info().get_content_charset()
>>>

That last one didn't specify a charset anywhere, so get_content_charset()returned None.

最后一个没有在任何地方指定字符集，所以get_content_charset()返回None.

Answer 4

回答by Brian Peterson

If you happen to be familiar with the Flask/Werkzeugweb development stack, you will be happy to know the Werkzeug library has an answer for exactly this kind of HTTP header parsing, and accounts for the case that the content-type is not specified at all, like you had wanted.

如果您碰巧熟悉Flask/ WerkzeugWeb 开发堆栈，您会很高兴知道 Werkzeug 库为这种 HTTP 标头解析提供了一个答案，并解释了未指定内容类型的情况一切，就像你想要的那样。

 >>> from werkzeug.http import parse_options_header
 >>> import requests
 >>> url = 'http://some.url.value'
 >>> resp = requests.get(url)
 >>> if resp.status_code is requests.codes.ok:
 ...     content_type_header = resp.headers.get('content_type')
 ...     print content_type_header
 'text/html; charset=utf-8'
 >>> parse_options_header(content_type_header) 
 ('text/html', {'charset': 'utf-8'})

So then you can do:

那么你可以这样做：

 >>> content_type_header[1].get('charset')
 'utf-8'

Note that if charsetis not supplied, this will produce instead:

请注意，如果charset未提供，则会产生：

 >>> parse_options_header('text/html')
 ('text/html', {})

It even works if you don't supply anything but an empty string or dict:

如果您只提供空字符串或字典，它甚至可以工作：

 >>> parse_options_header({})
 ('', {})
 >>> parse_options_header('')
 ('', {})

Thus it seems to be EXACTLY what you were looking for! If you look at the source code, you will see they had your purpose in mind: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

因此，它似乎正是您要找的东西！如果你查看源代码，你会发现他们有你的目的：https: //github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):
    """Parse a ``Content-Type`` like header into a tuple with the content
    type and the options:
    >>> parse_options_header('text/html; charset=utf8')
    ('text/html', {'charset': 'utf8'})
    This should not be used to parse ``Cache-Control`` like headers that use
    a slightly different format.  For these headers use the
    :func:`parse_dict_header` function.
    ...

Hope this helps someone some day! :)

希望有一天这对某人有所帮助！:)

Answer 5

回答by Mikhail Korobov

To properly (i.e. in a browser-like way - we can't do better) decode html you need to take in account:

要正确（即以类似浏览器的方式 - 我们不能做得更好）解码 html，您需要考虑：

Content-Type HTTP header value;
BOM marks;
<meta>tags in page body;
Differences between encoding names defined used in web an encoding names available in Python stdlib;
As a last resort, if everything else fails, guessing based on statistics is an option.

Content-Type HTTP 标头值；
BOM 标记；
<meta>页面正文中的标签；
Web 中定义的编码名称与 Python stdlib 中可用的编码名称之间的差异；
作为最后的手段，如果其他一切都失败了，则可以选择基于统计数据进行猜测。

All of the above is implemented in w3lib.encoding.html_to_unicodefunction: it has html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None)signature and returns (detected_encoding, unicode_html_content)tuple.

以上所有内容都在w3lib.encoding.html_to_unicode函数中实现：它有html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None)签名并返回(detected_encoding, unicode_html_content)元组。

requests, BeautifulSoup, UnicodeDamnnit, chardet or flask's parse_options_header are not correct solutions as they all fail at some of these points.

requests、BeautifulSoup、UnicodeDamnnit、chardet 或 flask 的 parse_options_header 不是正确的解决方案，因为它们在其中的某些点上都失败了。

Answer 6

回答by Usama Tahir

This is what works for me perfectly. I am using python 2.7 and 3.4

这对我来说是完美的。我正在使用 python 2.7 和 3.4

print (text.encode('cp850','replace'))

在 Python 中获取 HTTP 响应的字符集/编码的好方法

提问by Clay Wardell

采纳答案by jfs

回答by dnozay

回答by Cees Timmerman

回答by Brian Peterson

回答by Mikhail Korobov

回答by Usama Tahir

相关推荐

最近更新

标签

在 Python 中获取 HTTP 响应的字符集/编码的好方法

提问by Clay Wardell

采纳答案by jfs

回答by dnozay

回答by Cees Timmerman

回答by Brian Peterson

回答by Mikhail Korobov

回答by Usama Tahir

相关推荐

Python 帮助定义全局名称

在 Python 中使用 sorted()

Python Django Rest Framework - 如何在 ModelSerializer 中添加自定义字段

你如何在 python 中做一个简单的“chmod +x”？

相关推荐

最近更新

标签