python 转换或去除“非法”Unicode 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2508847/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-04 00:48:25  来源:igfitidea点击:

Convert or strip out "illegal" Unicode characters

pythonunicodepymssql

提问by Oli

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.

我在 MSSQL 中有一个数据库,我正在将它移植到 SQLite/Django。我正在使用 pymssql 连接到数据库并将文本字段保存到本地 SQLite 数据库。

However for some characters, it explodes. I get complaints like this:

但是对于某些角色,它会爆炸。我收到这样的投诉:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is there some way I can convert the chars to proper unicode versions? Or strip them out?

有什么方法可以将字符转换为正确的 unicode 版本?还是把它们剥掉?

回答by YOU

When you decode, just pass 'ignore' to strip those characters

解码时,只需传递 'ignore' 即可去除这些字符

there is some more way of stripping / converting those are

还有一些剥离/转换的方法是

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only) 

Test

测试

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

回答by Alex Martelli

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:

一旦你有了 bytes 字符串s,而不是直接将它用作 unicode obj,而是使用正确的编解码器显式转换它,例如:

u = s.decode('latin-1')

and use uinstead of sin the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).

并在此点之后的代码中使用u而不是s(大概是写入sqlite的部分)。假设latin-1是最初用于生成字节字符串的编码 - 我们无法猜测,因此请尝试找出;-)。

As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

作为一般规则,我建议:不要在您的应用程序中将任何文本处理为编码的字节字符串——在输入后立即将它们解码为 un​​icode 对象,并且如有必要,在输出之前将它们编码回字节字符串。