Python 如何在保留矩阵维度的同时序列化 numpy 数组?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30698004/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I serialize a numpy array while preserving matrix dimensions?
提问by blz
numpy.array.tostring
doesn't seem to preserve information about matrix dimensions (see this question), requiring the user to issue a call to numpy.array.reshape
.
numpy.array.tostring
似乎没有保留有关矩阵维度的信息(请参阅此问题),需要用户发出对numpy.array.reshape
.
Is there a way to serialize a numpy array to JSON format while preserving this information?
有没有办法在保留此信息的同时将 numpy 数组序列化为 JSON 格式?
Note:The arrays may contain ints, floats or bools. It's reasonable to expect a transposed array.
注意:数组可能包含整数、浮点数或布尔值。期望转置数组是合理的。
Note 2:this is being done with the intent of passing the numpy array through a Storm topology using streamparse, in case such information ends up being relevant.
注 2:这样做的目的是使用 streamparse 通过 Storm 拓扑传递 numpy 数组,以防此类信息最终相关。
采纳答案by user2357112 supports Monica
pickle.dumps
or numpy.save
encode all the information needed to reconstruct an arbitrary NumPy array, even in the presence of endianness issues, non-contiguous arrays, or weird tuple dtypes. Endianness issues are probably the most important; you don't want array([1])
to suddenly become array([16777216])
because you loaded your array on a big-endian machine. pickle
is probably the more convenient option, though save
has its own benefits, given in the npy
format rationale.
pickle.dumps
或numpy.save
编码重建任意 NumPy 数组所需的所有信息,即使存在字节序问题、非连续数组或奇怪的元组数据类型。字节序问题可能是最重要的。你不想array([1])
突然变成array([16777216])
因为你在大端机器上加载了你的阵列。pickle
可能是更方便的选择,尽管save
在npy
格式基本原理中给出了它自己的好处。
The pickle
option:
该pickle
选项:
import pickle
a = # some NumPy array
serialized = pickle.dumps(a, protocol=0) # protocol 0 is printable ASCII
deserialized_a = pickle.loads(serialized)
numpy.save
uses a binary format, and it needs to write to a file, but you can get around that with io.BytesIO
:
numpy.save
使用二进制格式,它需要写入文件,但您可以使用以下方法解决io.BytesIO
:
a = # any NumPy array
memfile = io.BytesIO()
numpy.save(memfile, a)
memfile.seek(0)
serialized = json.dumps(memfile.read().decode('latin-1'))
# latin-1 maps byte n to unicode code point n
And to deserialize:
并反序列化:
memfile = io.BytesIO()
memfile.write(json.loads(serialized).encode('latin-1'))
memfile.seek(0)
a = numpy.load(memfile)
回答by daniel451
EDIT:As one can read in the comments of the question this solution deals with "normal" numpy arrays (floats, ints, bools ...) and not with multi-type structured arrays.
编辑:正如人们可以在问题的评论中阅读的那样,该解决方案处理“普通”numpy 数组(浮点数、整数、布尔值……)而不是多类型结构化数组。
Solution for serializing a numpy array of any dimensions and data types
序列化任意维度和数据类型的numpy数组的解决方案
As far as I know you can not simply serialize a numpy array with any data type and any dimension...but you can store its data type, dimension and information in a list representation and then serialize it using JSON.
据我所知,您不能简单地序列化具有任何数据类型和任何维度的 numpy 数组……但您可以将其数据类型、维度和信息存储在列表表示中,然后使用 JSON 对其进行序列化。
Imports needed:
需要进口:
import json
import base64
For encodingyou could use (nparray
is some numpy array of any data type and any dimensionality):
对于编码,您可以使用(nparray
是任何数据类型和任何维度的一些 numpy 数组):
json.dumps([str(nparray.dtype), base64.b64encode(nparray), nparray.shape])
After this you get a JSON dump (string) of your data, containing a list representation of its data type and shape as well as the arrays data/contents base64-encoded.
在此之后,您将获得数据的 JSON 转储(字符串),其中包含其数据类型和形状的列表表示以及 base64 编码的数组数据/内容。
And for decodingthis does the work (encStr
is the encoded JSON string, loaded from somewhere):
和解码这样做的工作(encStr
是编码JSON字符串,从什么地方装):
# get the encoded json dump
enc = json.loads(encStr)
# build the numpy data type
dataType = numpy.dtype(enc[0])
# decode the base64 encoded numpy array data and create a new numpy array with this data & type
dataArray = numpy.frombuffer(base64.decodestring(enc[1]), dataType)
# if the array had more than one data set it has to be reshaped
if len(enc) > 2:
dataArray.reshape(enc[2]) # return the reshaped numpy array containing several data sets
JSON dumps are efficient and cross-compatible for many reasons but just taking JSON leads to unexpected results if you want to store and load numpy arrays of any typeand any dimension.
出于多种原因,JSON 转储是高效且交叉兼容的,但如果您想存储和加载任何类型和任何维度的numpy 数组,仅采用 JSON 会导致意外结果。
This solution stores and loads numpy arrays regardless of the type or dimension and also restores it correctly (data type, dimension, ...)
此解决方案存储和加载 numpy 数组,而不管类型或维度如何,并且还可以正确恢复它(数据类型、维度等)
I tried several solutions myself months ago and this was the only efficient, versatile solution I came across.
几个月前我自己尝试了几种解决方案,这是我遇到的唯一高效、通用的解决方案。
回答by Ken
Try using numpy.array_repr
or numpy.array_str
.
尝试使用numpy.array_repr
或numpy.array_str
。
回答by Rebs
I found the code in Msgpack-numpy helpful. https://github.com/lebedov/msgpack-numpy/blob/master/msgpack_numpy.py
我发现 Msgpack-numpy 中的代码很有帮助。 https://github.com/lebedov/msgpack-numpy/blob/master/msgpack_numpy.py
I modified the serialised dict slightly and added base64 encoding to reduce the serialised size.
我稍微修改了序列化的 dict 并添加了 base64 编码以减少序列化的大小。
By using the same interface as json (providing load(s),dump(s)), you can provide a drop-in replacement for json serialisation.
通过使用与 json 相同的接口(提供负载、转储),您可以提供 json 序列化的替代品。
This same logic can be extended to add any automatic non-trivial serialisation, such as datetime objects.
可以扩展相同的逻辑以添加任何自动的非平凡序列化,例如日期时间对象。
EDITI've written a generic, modular, parser that does this and more. https://github.com/someones/jaweson
编辑我写了一个通用的、模块化的、解析器来完成这个以及更多。 https://github.com/someones/jaweson
My code is as follows:
我的代码如下:
np_json.py
np_json.py
from json import *
import json
import numpy as np
import base64
def to_json(obj):
if isinstance(obj, (np.ndarray, np.generic)):
if isinstance(obj, np.ndarray):
return {
'__ndarray__': base64.b64encode(obj.tostring()),
'dtype': obj.dtype.str,
'shape': obj.shape,
}
elif isinstance(obj, (np.bool_, np.number)):
return {
'__npgeneric__': base64.b64encode(obj.tostring()),
'dtype': obj.dtype.str,
}
if isinstance(obj, set):
return {'__set__': list(obj)}
if isinstance(obj, tuple):
return {'__tuple__': list(obj)}
if isinstance(obj, complex):
return {'__complex__': obj.__repr__()}
# Let the base class default method raise the TypeError
raise TypeError('Unable to serialise object of type {}'.format(type(obj)))
def from_json(obj):
# check for numpy
if isinstance(obj, dict):
if '__ndarray__' in obj:
return np.fromstring(
base64.b64decode(obj['__ndarray__']),
dtype=np.dtype(obj['dtype'])
).reshape(obj['shape'])
if '__npgeneric__' in obj:
return np.fromstring(
base64.b64decode(obj['__npgeneric__']),
dtype=np.dtype(obj['dtype'])
)[0]
if '__set__' in obj:
return set(obj['__set__'])
if '__tuple__' in obj:
return tuple(obj['__tuple__'])
if '__complex__' in obj:
return complex(obj['__complex__'])
return obj
# over-write the load(s)/dump(s) functions
def load(*args, **kwargs):
kwargs['object_hook'] = from_json
return json.load(*args, **kwargs)
def loads(*args, **kwargs):
kwargs['object_hook'] = from_json
return json.loads(*args, **kwargs)
def dump(*args, **kwargs):
kwargs['default'] = to_json
return json.dump(*args, **kwargs)
def dumps(*args, **kwargs):
kwargs['default'] = to_json
return json.dumps(*args, **kwargs)
You should be able to then do the following:
然后,您应该能够执行以下操作:
import numpy as np
import np_json as json
np_data = np.zeros((10,10), dtype=np.float32)
new_data = json.loads(json.dumps(np_data))
assert (np_data == new_data).all()
回答by Chris.Wilson
If it needs to be human readable and you know that this is a numpy array:
如果它需要人类可读并且您知道这是一个 numpy 数组:
import numpy as np;
import json;
a = np.random.normal(size=(50,120,150))
a_reconstructed = np.asarray(json.loads(json.dumps(a.tolist())))
print np.allclose(a,a_reconstructed)
print (a==a_reconstructed).all()
Maybe not the most efficient as the array sizes grow larger, but works for smaller arrays.
随着数组大小的增长,可能不是最有效的,但适用于较小的数组。
回答by thayne
Msgpack has the best serialization performance: http://www.benfrederickson.com/dont-pickle-your-data/
Msgpack 序列化性能最好:http: //www.benfrederickson.com/dont-pickle-your-data/
Use msgpack-numpy. See https://github.com/lebedov/msgpack-numpy
使用 msgpack-numpy。见https://github.com/lebedov/msgpack-numpy
Install it:
安装它:
pip install msgpack-numpy
Then:
然后:
import msgpack
import msgpack_numpy as m
import numpy as np
x = np.random.rand(5)
x_enc = msgpack.packb(x, default=m.encode)
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
回答by SemanticBeeng
Try traitschema
https://traitschema.readthedocs.io/en/latest/
试试traitschema
https://traitschema.readthedocs.io/en/latest/
"Create serializable, type-checked schema using traits and Numpy. A typical use case involves saving several Numpy arrays of varying shape and type."
“使用特征和 Numpy 创建可序列化、类型检查的模式。典型用例涉及保存多个不同形状和类型的 Numpy 数组。”