Python 3 - pickle 可以处理大于 4GB 的字节对象吗？

Question

提问by RandomBits

Based on this commentand the referenced documentation, Pickle 4.0+ from Python 3.4+ should be able to pickle byte objects larger than 4?GB.

基于此评论和参考文档，来自 Python 3.4+ 的 Pickle 4.0+ 应该能够pickle 大于 4?GB 的字节对象。

However, using python 3.4.3 or python 3.5.0b2 on Mac OS X 10.10.4, I get an error when I try to pickle a large byte array:

但是，在 Mac OS X 10.10.4 上使用 python 3.4.3 或 python 3.5.0b2，当我尝试腌制大字节数组时出现错误：

>>> import pickle
>>> x = bytearray(8 * 1000 * 1000 * 1000)
>>> fp = open("x.dat", "wb")
>>> pickle.dump(x, fp, protocol = 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

Is there a bug in my code or am I misunderstanding the documentation?

我的代码中有错误还是我误解了文档？

Answer 1

回答by Martin Thoma

To sum up what was answered in the comments:

总结一下评论中的回答：

Yes, Python can pickle byte objects bigger than 4GB. The observed error is caused by a bug in the implementation (see Issue24658).

是的，Python 可以腌制大于 4GB 的字节对象。观察到的错误是由实现中的错误引起的（请参阅Issue24658）。

Answer 2

回答by lunguini

Here is a simple workaround for issue 24658. Use pickle.loadsor pickle.dumpsand break the bytes object into chunks of size 2**31 - 1to get it in or out of the file.

这是问题 24658的简单解决方法。使用pickle.loads或pickle.dumps并将字节对象分成大小的块2**31 - 1以将其放入或取出文件。

import pickle
import os.path

file_path = "pkl.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1
data = bytearray(n_bytes)

## write
bytes_out = pickle.dumps(data)
with open(file_path, 'wb') as f_out:
    for idx in range(0, len(bytes_out), max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

## read
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
    for _ in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)
data2 = pickle.loads(bytes_in)

assert(data == data2)

Answer 3

回答by markhor

Reading a file by 2GB chunks takes twice as much memory as needed if bytesconcatenation is performed, my approach to loadingpickles is based on bytearray:

如果bytes执行连接，以 2GB 块读取文件所需的内存是所需内存的两倍，我加载泡菜的方法基于字节数组：

class MacOSFile(object):
    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        if n >= (1 << 31):
            buffer = bytearray(n)
            pos = 0
            while pos < n:
                size = min(n - pos, 1 << 31 - 1)
                chunk = self.f.read(size)
                buffer[pos:pos + size] = chunk
                pos += size
            return buffer
        return self.f.read(n)

Usage:

用法：

with open("/path", "rb") as fin:
    obj = pickle.load(MacOSFile(fin))

Answer 4

回答by Sam Cohan

Here is the full workaround, though it seems pickle.load no longer tries to dump a huge file anymore (I am on Python 3.5.2) so strictly speaking only the pickle.dumps needs this to work properly.

这是完整的解决方法，尽管似乎 pickle.load 不再尝试转储一个大文件（我使用的是 Python 3.5.2）所以严格来说只有 pickle.dumps 需要它才能正常工作。

import pickle

class MacOSFile(object):

    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        # print("reading total_bytes=%s" % n, flush=True)
        if n >= (1 << 31):
            buffer = bytearray(n)
            idx = 0
            while idx < n:
                batch_size = min(n - idx, 1 << 31 - 1)
                # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
                buffer[idx:idx + batch_size] = self.f.read(batch_size)
                # print("done.", flush=True)
                idx += batch_size
            return buffer
        return self.f.read(n)

    def write(self, buffer):
        n = len(buffer)
        print("writing total_bytes=%s..." % n, flush=True)
        idx = 0
        while idx < n:
            batch_size = min(n - idx, 1 << 31 - 1)
            print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
            self.f.write(buffer[idx:idx + batch_size])
            print("done.", flush=True)
            idx += batch_size


def pickle_dump(obj, file_path):
    with open(file_path, "wb") as f:
        return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)


def pickle_load(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(MacOSFile(f))

Answer 5

回答by raditya gumay

I also found this issue, to solve this problem i chunk the code into several iteration. Let say in this case i have 50.000 data which i have to calc tf-idf and do knn classfication. When i run and directly iterate 50.000 it give me "that error". So, to solve this problem i chunk it.

我也发现了这个问题，为了解决这个问题，我将代码分成几个迭代。假设在这种情况下，我有 50.000 个数据，我必须计算 tf-idf 并进行 knn 分类。当我运行并直接迭代 50.000 时，它给了我“那个错误”。所以，为了解决这个问题，我把它分块。

tokenized_documents = self.load_tokenized_preprocessing_documents()
    idf = self.load_idf_41227()
    doc_length = len(documents)
    for iteration in range(0, 9):
        tfidf_documents = []
        for index in range(iteration, 4000):
            doc_tfidf = []
            for term in idf.keys():
                tf = self.term_frequency(term, tokenized_documents[index])
                doc_tfidf.append(tf * idf[term])
            doc = documents[index]
            tfidf = [doc_tfidf, doc[0], doc[1]]
            tfidf_documents.append(tfidf)
            print("{} from {} document {}".format(index, doc_length, doc[0]))

        self.save_tfidf_41227(tfidf_documents, iteration)

Answer 6

回答by ihopethiswillfi

Had the same issue and fixed it by upgrading to Python 3.6.8.

有同样的问题并通过升级到 Python 3.6.8 修复它。

This seems to be the PR that did it: https://github.com/python/cpython/pull/9937

这似乎是做到这一点的公关：https: //github.com/python/cpython/pull/9937

Answer 7

回答by Yohan Obadia

You can specify the protocol for the dump. If you do pickle.dump(obj,file,protocol=4)it should work.

您可以指定转储的协议。如果你这样做pickle.dump(obj,file,protocol=4)应该工作。

Python 3 - pickle 可以处理大于 4GB 的字节对象吗？

提问by RandomBits

回答by Martin Thoma

回答by lunguini

回答by markhor

回答by Sam Cohan

回答by raditya gumay

回答by ihopethiswillfi

回答by Yohan Obadia

相关推荐

最近更新

标签

Python 3 - pickle 可以处理大于 4GB 的字节对象吗？

提问by RandomBits

回答by Martin Thoma

回答by lunguini

回答by markhor

回答by Sam Cohan

回答by raditya gumay

回答by ihopethiswillfi

回答by Yohan Obadia

相关推荐

Python 无法导入 tweepy 模块

Python django 查询中的 sql“LIKE”等价物

Python 如何使用 Spark 查找中位数和分位数

Python OperationalError: (2002, "无法通过套接字'/var/run/mysqld/mysqld.sock' (2) 连接到本地 MySQL 服务器")

相关推荐

最近更新

标签