Python - Unicode 到 ASCII 的转换

Question

提问by Adriano Almeida

I am unable to convert the following Unicode to ASCII without losing data:

我无法在不丢失数据的情况下将以下 Unicode 转换为 ASCII：

u'ABRA\xc3O JOS\xc9'

I tried encodeand decodeand they won't do it.

我试过了encode，decode他们不会这样做。

Does anyone have a suggestion?

有人有建议吗？

Answer 1

采纳答案by abarnert

The Unicode characters u'\xce0'and u'\xc9'do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

Unicode 字符u'\xce0'和u'\xc9'没有任何对应的 ASCII 值。因此，如果您不想丢失数据，则必须以某种有效的 ASCII 方式对该数据进行编码。选项包括：

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRA&#195;O JOS&#201;
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).

所有这些都是 ASCII 字符串，并包含来自原始 Unicode 字符串的所有信息（因此它们都可以在不丢失数据的情况下反转），但是对于最终用户来说，它们都不是那么漂亮（并且它们都不能只需通过decode('ascii')) 即可反转。

See str.encode, Python Specific Encodings, and Unicode HOWTOfor more info.

见str.encode，Python中的特定编码，和Unicode指南获取更多信息。

As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

作为旁注，当有些人说“ASCII”时，他们的意思并不是“ASCII”，而是“作为 ASCII 超集的任何 8 位字符集”或“我所拥有的某些特定的 8 位字符集头脑”。如果这就是您的意思，那么解决方案是编码为正确的 8 位字符集：

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the openfunction or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

困难的部分是知道您指的是哪个字符集。如果您同时编写产生 8 位字符串的代码和使用它的代码，并且您不知道更好，那么您的意思是 UTF-8。如果使用 8 位字符串的代码是open您正在向其提供页面的函数或 Web 浏览器或其他东西，则事情会更加复杂，如果没有更多信息，就没有简单的答案。

Answer 2

回答by Rhythm Chopra

I needed to calculate the MD5 hashof a unicode stringreceived in HTTP request. MD5 was giving UnicodeEncodeErrorand python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex valuesfor the characters thus changing the MD5 hash. So I came up with the following code, which keeps the string intact while converting from unicode.

我需要计算MD5 hash的unicode string所接收HTTP request。MD5 给出UnicodeEncodeError并且 python 内置编码方法不起作用，因为它将字符串中的字符替换为对应hex values的字符，从而更改了MD5 hash. 所以我想出了以下代码，它在从unicode.

unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()

This removes the unicodepart from the string and keeps all the data intact.

这会unicode从字符串中删除该部分并保持所有数据完整无缺。

Python - Unicode 到 ASCII 的转换

提问by Adriano Almeida

采纳答案by abarnert

回答by Rhythm Chopra

相关推荐

最近更新

标签

Python - Unicode 到 ASCII 的转换

提问by Adriano Almeida

采纳答案by abarnert

回答by Rhythm Chopra

相关推荐

Python 计算熊猫的行平均值

Python 错误“导入：无法打开 X 服务器”

使用python通过sftp上传文件

Python 'float' 对象不能被解释为 int，但转换为 int 不会产生任何输出

相关推荐

最近更新

标签