Python 将带有 utf-8 字符串作为内容的 unicode 转换为 str

Question

提问by wong2

I'm using pyquery to parse a page:

我正在使用 pyquery 来解析页面：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in contentis a unicode string with utf-8 encoded content:

但我得到的content是一个带有 utf-8 编码内容的 unicode 字符串：

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to strwithout lost the content?

我怎样才能在str不丢失内容的情况下将其转换为？

to make it clear:

说清楚：

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

我想要 conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

不是 conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

Answer 1

采纳答案by Martijn Pieters

If you have a unicodevalue with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

如果您有一个unicodeUTF-8 字节的值，请编码为 Latin-1 以保留“字节”：

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

因为Unicode代码点U+0000到U+00FF都与latin-1编码一一对应；因此，这种编码将您的数据解释为文字字节。

For your example this gives me:

对于您的示例，这给了我：

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

PyQueryuses either requestsor urllibto retrieve the HTML, and in the case of requests, uses the .textattribute of the response. This auto-decodes the response data based on the encoding set in a Content-Typeheader alone, or if that information is not available, uses latin-1for this (for text responses, but HTML is a text response). You can override this by passing in an encodingargument:

PyQuery使用requests或urllib来检索 HTML，在的情况下requests，使用.text响应的属性。这会根据Content-Type标头中的编码集自动解码响应数据，或者如果该信息不可用，则latin-1用于此（对于文本响应，但 HTML 是文本响应）。您可以通过传入encoding参数来覆盖它：

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

在这一点上，您根本不必重新编码。

Python 将带有 utf-8 字符串作为内容的 unicode 转换为 str

提问by wong2

采纳答案by Martijn Pieters

相关推荐

最近更新

标签

Python 将带有 utf-8 字符串作为内容的 unicode 转换为 str

提问by wong2

采纳答案by Martijn Pieters

相关推荐

Python 如何使用 Paramiko 获取 SSH 返回码？

Python if 语句中冒号的语法错误

Python 将多个 CSV 文件中的列合并为一个文件

使用 Python 进行 URL 编码/解码

相关推荐

最近更新

标签