Python - Unicode 到 ASCII 的转换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19527279/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:58:23  来源:igfitidea点击:

Python - Unicode to ASCII conversion

pythonunicodeencodingascii

提问by Adriano Almeida

I am unable to convert the following Unicode to ASCII without losing data:

我无法在不丢失数据的情况下将以下 Unicode 转换为 ASCII:

u'ABRA\xc3O JOS\xc9'

I tried encodeand decodeand they won't do it.

我试过了encodedecode他们不会这样做。

Does anyone have a suggestion?

有人有建议吗?

采纳答案by abarnert

The Unicode characters u'\xce0'and u'\xc9'do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

Unicode 字符u'\xce0'u'\xc9'没有任何对应的 ASCII 值。因此,如果您不想丢失数据,则必须以某种有效的 ASCII 方式对该数据进行编码。选项包括:

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).

所有这些都是 ASCII 字符串,并包含来自原始 Unicode 字符串的所有信息(因此它们都可以在不丢失数据的情况下反转),但是对于最终用户来说,它们都不是那么漂亮(并且它们都不能只需通过decode('ascii')) 即可反转。

See str.encode, Python Specific Encodings, and Unicode HOWTOfor more info.

str.encodePython中的特定编码,和Unicode指南获取更多信息。



As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

作为旁注,当有些人说“ASCII”时,他们的意思并不是“ASCII”,而是“作为 ASCII 超集的任何 8 位字符集”或“我所拥有的某些特定的 8 位字符集头脑”。如果这就是您的意思,那么解决方案是编码为正确的 8 位字符集:

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the openfunction or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

困难的部分是知道您指的是哪个字符集。如果您同时编写产生 8 位字符串的代码和使用它的代码,并且您不知道更好,那么您的意思是 UTF-8。如果使用 8 位字符串的代码是open您正在向其提供页面的函数或 Web 浏览器或其他东西,则事情会更加复杂,如果没有更多信息,就没有简单的答案。

回答by Rhythm Chopra

I needed to calculate the MD5 hashof a unicode stringreceived in HTTP request. MD5 was giving UnicodeEncodeErrorand python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex valuesfor the characters thus changing the MD5 hash. So I came up with the following code, which keeps the string intact while converting from unicode.

我需要计算MD5 hashunicode string所接收HTTP request。MD5 给出UnicodeEncodeError并且 python 内置编码方法不起作用,因为它将字符串中的字符替换为对应hex values的字符,从而更改了MD5 hash. 所以我想出了以下代码,它在从unicode.

unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()

This removes the unicodepart from the string and keeps all the data intact.

这会unicode从字符串中删除该部分并保持所有数据完整无缺。