Python 3 - pickle 可以处理大于 4GB 的字节对象吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31468117/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python 3 - Can pickle handle byte objects larger than 4GB?
提问by RandomBits
Based on this commentand the referenced documentation, Pickle 4.0+ from Python 3.4+ should be able to pickle byte objects larger than 4?GB.
基于此评论和参考文档,来自 Python 3.4+ 的 Pickle 4.0+ 应该能够pickle 大于 4?GB 的字节对象。
However, using python 3.4.3 or python 3.5.0b2 on Mac OS X 10.10.4, I get an error when I try to pickle a large byte array:
但是,在 Mac OS X 10.10.4 上使用 python 3.4.3 或 python 3.5.0b2,当我尝试腌制大字节数组时出现错误:
>>> import pickle
>>> x = bytearray(8 * 1000 * 1000 * 1000)
>>> fp = open("x.dat", "wb")
>>> pickle.dump(x, fp, protocol = 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
Is there a bug in my code or am I misunderstanding the documentation?
我的代码中有错误还是我误解了文档?
回答by Martin Thoma
To sum up what was answered in the comments:
总结一下评论中的回答:
Yes, Python can pickle byte objects bigger than 4GB. The observed error is caused by a bug in the implementation (see Issue24658).
是的,Python 可以腌制大于 4GB 的字节对象。观察到的错误是由实现中的错误引起的(请参阅Issue24658)。
回答by lunguini
Here is a simple workaround for issue 24658. Use pickle.loads
or pickle.dumps
and break the bytes object into chunks of size 2**31 - 1
to get it in or out of the file.
这是问题 24658的简单解决方法。使用pickle.loads
或pickle.dumps
并将字节对象分成大小的块2**31 - 1
以将其放入或取出文件。
import pickle
import os.path
file_path = "pkl.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1
data = bytearray(n_bytes)
## write
bytes_out = pickle.dumps(data)
with open(file_path, 'wb') as f_out:
for idx in range(0, len(bytes_out), max_bytes):
f_out.write(bytes_out[idx:idx+max_bytes])
## read
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
for _ in range(0, input_size, max_bytes):
bytes_in += f_in.read(max_bytes)
data2 = pickle.loads(bytes_in)
assert(data == data2)
回答by markhor
Reading a file by 2GB chunks takes twice as much memory as needed if bytes
concatenation is performed, my approach to loadingpickles is based on bytearray:
如果bytes
执行连接,以 2GB 块读取文件所需的内存是所需内存的两倍,我加载泡菜的方法基于字节数组:
class MacOSFile(object):
def __init__(self, f):
self.f = f
def __getattr__(self, item):
return getattr(self.f, item)
def read(self, n):
if n >= (1 << 31):
buffer = bytearray(n)
pos = 0
while pos < n:
size = min(n - pos, 1 << 31 - 1)
chunk = self.f.read(size)
buffer[pos:pos + size] = chunk
pos += size
return buffer
return self.f.read(n)
Usage:
用法:
with open("/path", "rb") as fin:
obj = pickle.load(MacOSFile(fin))
回答by Sam Cohan
Here is the full workaround, though it seems pickle.load no longer tries to dump a huge file anymore (I am on Python 3.5.2) so strictly speaking only the pickle.dumps needs this to work properly.
这是完整的解决方法,尽管似乎 pickle.load 不再尝试转储一个大文件(我使用的是 Python 3.5.2)所以严格来说只有 pickle.dumps 需要它才能正常工作。
import pickle
class MacOSFile(object):
def __init__(self, f):
self.f = f
def __getattr__(self, item):
return getattr(self.f, item)
def read(self, n):
# print("reading total_bytes=%s" % n, flush=True)
if n >= (1 << 31):
buffer = bytearray(n)
idx = 0
while idx < n:
batch_size = min(n - idx, 1 << 31 - 1)
# print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
buffer[idx:idx + batch_size] = self.f.read(batch_size)
# print("done.", flush=True)
idx += batch_size
return buffer
return self.f.read(n)
def write(self, buffer):
n = len(buffer)
print("writing total_bytes=%s..." % n, flush=True)
idx = 0
while idx < n:
batch_size = min(n - idx, 1 << 31 - 1)
print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
self.f.write(buffer[idx:idx + batch_size])
print("done.", flush=True)
idx += batch_size
def pickle_dump(obj, file_path):
with open(file_path, "wb") as f:
return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)
def pickle_load(file_path):
with open(file_path, "rb") as f:
return pickle.load(MacOSFile(f))
回答by raditya gumay
I also found this issue, to solve this problem i chunk the code into several iteration. Let say in this case i have 50.000 data which i have to calc tf-idf and do knn classfication. When i run and directly iterate 50.000 it give me "that error". So, to solve this problem i chunk it.
我也发现了这个问题,为了解决这个问题,我将代码分成几个迭代。假设在这种情况下,我有 50.000 个数据,我必须计算 tf-idf 并进行 knn 分类。当我运行并直接迭代 50.000 时,它给了我“那个错误”。所以,为了解决这个问题,我把它分块。
tokenized_documents = self.load_tokenized_preprocessing_documents()
idf = self.load_idf_41227()
doc_length = len(documents)
for iteration in range(0, 9):
tfidf_documents = []
for index in range(iteration, 4000):
doc_tfidf = []
for term in idf.keys():
tf = self.term_frequency(term, tokenized_documents[index])
doc_tfidf.append(tf * idf[term])
doc = documents[index]
tfidf = [doc_tfidf, doc[0], doc[1]]
tfidf_documents.append(tfidf)
print("{} from {} document {}".format(index, doc_length, doc[0]))
self.save_tfidf_41227(tfidf_documents, iteration)
回答by ihopethiswillfi
Had the same issue and fixed it by upgrading to Python 3.6.8.
有同样的问题并通过升级到 Python 3.6.8 修复它。
This seems to be the PR that did it: https://github.com/python/cpython/pull/9937
这似乎是做到这一点的公关:https: //github.com/python/cpython/pull/9937
回答by Yohan Obadia
You can specify the protocol for the dump.
If you do pickle.dump(obj,file,protocol=4)
it should work.
您可以指定转储的协议。如果你这样做pickle.dump(obj,file,protocol=4)
应该工作。