Python 将带有 utf-8 字符串作为内容的 unicode 转换为 str
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14539807/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert unicode with utf-8 string as content to str
提问by wong2
I'm using pyquery to parse a page:
我正在使用 pyquery 来解析页面:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in contentis a unicode string with utf-8 encoded content:
但我得到的content是一个带有 utf-8 编码内容的 unicode 字符串:
u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'
how could I convert it to strwithout lost the content?
我怎样才能在str不丢失内容的情况下将其转换为?
to make it clear:
说清楚:
I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
我想要 conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
不是 conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
采纳答案by Martijn Pieters
If you have a unicodevalue with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
如果您有一个unicodeUTF-8 字节的值,请编码为 Latin-1 以保留“字节”:
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
因为Unicode代码点U+0000到U+00FF都与latin-1编码一一对应;因此,这种编码将您的数据解释为文字字节。
For your example this gives me:
对于您的示例,这给了我:
>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQueryuses either requestsor urllibto retrieve the HTML, and in the case of requests, uses the .textattribute of the response. This auto-decodes the response data based on the encoding set in a Content-Typeheader alone, or if that information is not available, uses latin-1for this (for text responses, but HTML is a text response). You can override this by passing in an encodingargument:
PyQuery使用requests或urllib来检索 HTML,在 的情况下requests,使用.text响应的属性。这会根据Content-Type标头中的编码集自动解码响应数据,或者如果该信息不可用,则latin-1用于此(对于文本响应,但 HTML 是文本响应)。您可以通过传入encoding参数来覆盖它:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
在这一点上,您根本不必重新编码。

