python 转换或去除“非法”Unicode 字符

Question

提问by Oli

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.

我在 MSSQL 中有一个数据库，我正在将它移植到 SQLite/Django。我正在使用 pymssql 连接到数据库并将文本字段保存到本地 SQLite 数据库。

However for some characters, it explodes. I get complaints like this:

但是对于某些角色，它会爆炸。我收到这样的投诉：

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is there some way I can convert the chars to proper unicode versions? Or strip them out?

有什么方法可以将字符转换为正确的 unicode 版本？还是把它们剥掉？

Answer 1

回答by YOU

When you decode, just pass 'ignore' to strip those characters

解码时，只需传递 'ignore' 即可去除这些字符

there is some more way of stripping / converting those are

还有一些剥离/转换的方法是

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 

'ignore': ignore malformed data and continue without further notice 

'backslashreplace': replace with backslashed escape sequences (for encoding only)

Test

测试

>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Answer 2

回答by Alex Martelli

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:

一旦你有了 bytes 字符串s，而不是直接将它用作 unicode obj，而是使用正确的编解码器显式转换它，例如：

u = s.decode('latin-1')

and use uinstead of sin the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).

并在此点之后的代码中使用u而不是s（大概是写入sqlite的部分）。假设latin-1是最初用于生成字节字符串的编码 - 我们无法猜测，因此请尝试找出;-)。

As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

作为一般规则，我建议：不要在您的应用程序中将任何文本处理为编码的字节字符串——在输入后立即将它们解码为 unicode 对象，并且如有必要，在输出之前将它们编码回字节字符串。

python 转换或去除“非法”Unicode 字符

提问by Oli

回答by YOU

回答by Alex Martelli

相关推荐

最近更新

标签

python 转换或去除“非法”Unicode 字符

提问by Oli

回答by YOU

回答by Alex Martelli

相关推荐

python 运行 Numpy Meshgrid 时出现内存错误

python Numpy 图像 - 将矩阵旋转 270 度

python 简单的图像服务器

python 有人可以解释一下这个错误到底是什么意思，TypeError: issubclass() arg 1 must be a class

相关推荐

最近更新

标签