Python 如何pickle unicodes并将它们保存在utf-8数据库中

Question

提问by Jorge Leitao

I have a database (mysql) where I want to store pickled data.

我有一个数据库 (mysql)，我想在其中存储腌制数据。

The data can be for instance a dictionary, which may contain unicode, e.g.

数据可以是例如字典，其中可能包含 unicode，例如

data = {1 : u'é'}

and the database (mysql) is in utf-8.

并且数据库 (mysql) 是 utf-8。

When I pickle,

当我腌制时，

import pickle
pickled_data = pickle.dumps(data)
print type(pickled_data) # returns <type 'str'>

the resulting pickled_data is a string.

结果 pickled_data 是一个字符串。

When I try to store this in a database (e.g. in a Textfield) this can causes problems. In particular, I'm getting at some point a

当我尝试将其存储在数据库中（例如在文本字段中）时，这可能会导致问题。特别是，我在某个时候得到了

UnicodeDecodeError "'utf8' codec can't decode byte 0xe9 in position X"

when trying to save the pickled_data in the database. This makes sense because pickled_data can have non-utf-8 characters. My question is how do I store pickled_data on a utf-8 database?

尝试将 pickled_data 保存在数据库中时。这是有道理的，因为 pickled_data 可以包含非 utf-8 字符。我的问题是如何将 pickled_data 存储在 utf-8 数据库中？

I see two possible candidates:

我看到两个可能的候选人：

Encode the result of the pickle.dump to utf-8 and store it. When I want to pickle.load, I have to decode it.
Store the pickled string in binary format (how?), which forces all characters to be within ascii.

将 pickle.dump 的结果编码为 utf-8 并存储。当我想pickle.load时，我必须对其进行解码。
以二进制格式存储腌制字符串（如何？），这会强制所有字符都在 ascii 内。

My issue is that I'm not seeing what are the consequences of choosing one of this options in the long run. Since the change already requires some effort, I'm driven to ask for an opinion on this issue, asking for eventual better candidates.

我的问题是，从长远来看，我没有看到选择其中一个选项的后果是什么。由于更改已经需要一些努力，我被迫就这个问题征求意见，寻求最终更好的候选人。

(P.S. This is for instance useful in Django)

（PS 这例如在Django 中很有用）

Answer 1

采纳答案by Martijn Pieters

Pickle data is opaque, binary data, even when you use protocol version 0:

Pickle 数据是不透明的二进制数据，即使您使用协议版本 0：

>>> pickle.dumps(data, 0)
'(dp0\nI1\nV\xe9\np1\ns.'

When you try to store that in a TextField, Django will try to decode that data to UTF8 to store it; this is what fails because this is not UTF-8 encoded data; it is binary data instead:

当您尝试将其存储在 a 中时TextField，Django 会尝试将该数据解码为 UTF8 以存储它；这就是失败的原因，因为这不是 UTF-8 编码的数据；它是二进制数据：

>>> pickled_data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to nottry to store this in a TextField. Use a BinaryFieldinstead:

解决方案是不要尝试将其存储在TextField. 使用 aBinaryField代替：

A field to store raw binary data. It only supports bytesassignment. Be aware that this field has limited functionality. For example, it is not possible to filter a queryset on a BinaryField value.

存储原始二进制数据的字段。它只支持bytes赋值。请注意，此字段的功能有限。例如，不可能根据 BinaryField 值过滤查询集。

You have a bytesvalue (Python 2 strings are byte strings, renamed to bytesin Python 3).

您有一个bytes值（Python 2 字符串是字节字符串，bytes在 Python 3 中重命名为）。

If you insist on storing the data in a text field, explicitly decode it as latin1; the Latin 1 codec maps bytes one-on-one to Unicode codepoints:

如果您坚持将数据存储在文本字段中，请将其显式解码为latin1; 拉丁 1 编解码器将字节一对一映射到 Unicode 代码点：

>>> pickled_data.decode('latin1')
u'(dp0\nI1\nV\xe9\np1\ns.'

and make sure you encodeit again before unpickling again:

并确保在再次解压之前再次对其进行编码：

>>> encoded = pickled_data.decode('latin1')
>>> pickle.loads(encoded)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py", line 1381, in loads
    file = StringIO(str)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
>>> pickle.loads(encoded.encode('latin1'))
{1: u'\xe9'}

Do note that if you let this value go to the browser and back again in a text field, the browser is likely to have replaced characters in that data. Internet Explorer will replace \ncharacters with \r\n, for example, because it assumes it is dealing with text.

请注意，如果您让该值进入浏览器并在文本字段中再次返回，则浏览器可能会替换该数据中的字符。例如，Internet Explorer 会将\n字符替换\r\n为，因为它假定它正在处理文本。

Not that you ever should allow accepting pickle data from a network connection in any case, because that is a security hole waiting for exploitation.

并不是说在任何情况下都应该允许接受来自网络连接的 pickle 数据，因为这是一个等待利用的安全漏洞。

Python 如何pickle unicodes并将它们保存在utf-8数据库中

提问by Jorge Leitao

采纳答案by Martijn Pieters

相关推荐

最近更新

标签

Python 如何pickle unicodes并将它们保存在utf-8数据库中

提问by Jorge Leitao

采纳答案by Martijn Pieters

相关推荐

Python matplotlib：设置主要和次要刻度强制相同的 x 和 y 比例

Python 如何将 QLabel 文本与标签的右边缘对齐

Python hasattr 与 getattr

Python 散点图的轴限制 - Matplotlib

相关推荐

最近更新

标签