pandas python multiprocessing - 溢出错误('无法序列化大于 4GiB 的字节对象')

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51562221/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:51:09  来源:igfitidea点击:

python multiprocessing - OverflowError('cannot serialize a bytes object larger than 4GiB')

pythonpandaspicklepython-multiprocessing

提问by Pablo

We are running a script using the multiprocessinglibrary (python 3.6), where a big pd.DataFrameis passed as an argument to a function :

我们正在使用multiprocessing库 ( python 3.6)运行一个脚本,其中一个 bigpd.DataFrame作为参数传递给一个函数:

from multiprocessing import Pool
import time 

def my_function(big_df):
    # do something time consuming
    time.sleep(50)

if __name__ == '__main__':
    with Pool(10) as p:
        res = {}
        output = {}
        for id, big_df in some_dict_of_big_dfs:
            res[id] = p.apply_async(my_function,(big_df ,))
        output = {u : res[id].get() for id in id_list}

The problem is that we are getting an error from the picklelibrary.

问题是我们从pickle库中得到一个错误。

Reason: 'OverflowError('cannot serialize a bytes objects larger than 4GiB',)'

原因:'OverflowError('无法序列化大于 4GiB 的字节对象',)'

We are aware than pickle v4can serialize larger objects question related, link, but we don't know how to modify the protocol that multiprocessingis using.

我们知道pickle v4可以序列化更大的对象问题相关链接,但我们不知道如何修改multiprocessing正在使用的协议。

does anybody know what to do? Thanks !!

有人知道该怎么做吗?谢谢 !!

回答by Pablo

Apparently is there an open issueabout this topic , and there is a few related initiatives described on this particular answer. I Found a way to change the default pickleprotocol that is used in the multiprocessinglibrary based on this answer. As was pointed out in the comments this solution Only works with Linux and OS multiprocessing lib

显然,关于这个主题有一个未解决的问题,并且在这个特定的答案中描述了一些相关的举措。我找到了一种方法来更改基于此答案库中pickle使用的默认协议。正如评论中指出的那样,该解决方案仅适用于 Linux 和 OS 多处理库multiprocessing

Solution:

解决方案:

You first create a new separated module

您首先创建一个新的分离模块

pickle4reducer.py

pickle4reducer.py

from multiprocessing.reduction import ForkingPickler, AbstractReducer

class ForkingPickler4(ForkingPickler):
    def __init__(self, *args):
        if len(args) > 1:
            args[1] = 2
        else:
            args.append(2)
        super().__init__(*args)

    @classmethod
    def dumps(cls, obj, protocol=4):
        return ForkingPickler.dumps(obj, protocol)


def dump(obj, file, protocol=4):
    ForkingPickler4(file, protocol).dump(obj)


class Pickle4Reducer(AbstractReducer):
    ForkingPickler = ForkingPickler4
    register = ForkingPickler4.register
    dump = dump

And then, in your main script you need to add the following:

然后,在您的主脚本中,您需要添加以下内容:

import pickle4reducer
import multiprocessing as mp
ctx = mp.get_context()
ctx.reducer = pickle4reducer.Pickle4Reducer()

with mp.Pool(4) as p:
    # do something

That will probably solve the problem of the overflow.

这可能会解决溢出的问题。

But, warning, you might consider reading thisbefore doing anythingor you might reach the same error as me:

但是,警告,您可能会考虑在做任何事情之前阅读本文否则您可能会遇到与我相同的错误:

'i' format requires -2147483648 <= number <= 2147483647

'i' 格式需要 -2147483648 <= number <= 2147483647

(the reason of this error is well explained in the linkabove). Long story short, multiprocessingsend data through all its process using the pickleprotocol, if you are already reaching the 4gblimit, that probably means that you might consider redefining your functions more as "void" methods rather than input/output methods. All this inbound/outbound data increase the RAM usage, is probably inefficient by construction (my case) and it might be better to point all process to the same object rather than create a new copy for each call.

(这个错误的原因在上面的链接中有很好的解释)。长话短说,multiprocessing使用pickle协议在其所有过程中发送数据,如果您已经达到4gb限制,那可能意味着您可能会考虑将您的函数重新定义为“无效”方法而不是输入/输出方法。所有这些入站/出站数据都会增加 RAM 使用量,可能是构造效率低下(我的情况),最好将所有进程指向同一个对象,而不是为每个调用创建一个新副本。

hope this helps.

希望这可以帮助。