Python 将带有 utf-8 字符串作为内容的 unicode 转换为 str

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14539807/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:46:22  来源:igfitidea点击:

Convert unicode with utf-8 string as content to str

pythonutf-8python-2.xmojibakepyquery

提问by wong2

I'm using pyquery to parse a page:

我正在使用 pyquery 来解析页面:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in contentis a unicode string with utf-8 encoded content:

但我得到的content是一个带有 utf-8 编码内容的 unicode 字符串:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to strwithout lost the content?

我怎样才能在str不丢失内容的情况下将其转换为?

to make it clear:

说清楚:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

我想要 conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

不是 conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

采纳答案by Martijn Pieters

If you have a unicodevalue with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

如果您有一个unicodeUTF-8 字节的值,请编码为 Latin-1 以保留“字节”:

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

因为Unicode代码点U+0000到U+00FF都与latin-1编码一一对应;因此,这种编码将您的数据解释为文字字节。

For your example this gives me:

对于您的示例,这给了我:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

PyQueryuses either requestsor urllibto retrieve the HTML, and in the case of requests, uses the .textattribute of the response. This auto-decodes the response data based on the encoding set in a Content-Typeheader alone, or if that information is not available, uses latin-1for this (for text responses, but HTML is a text response). You can override this by passing in an encodingargument:

PyQuery使用requestsurllib来检索 HTML,在 的情况下requests,使用.text响应的属性。这会根据Content-Type标头中的编码集自动解码响应数据,或者如果该信息不可用,则latin-1用于此(对于文本响应,但 HTML 是文本响应)。您可以通过传入encoding参数来覆盖它:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

在这一点上,您根本不必重新编码。