pandas python multiprocessing - 溢出错误('无法序列化大于 4GiB 的字节对象')
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51562221/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
python multiprocessing - OverflowError('cannot serialize a bytes object larger than 4GiB')
提问by Pablo
We are running a script using the multiprocessing
library (python 3.6
), where a big pd.DataFrame
is passed as an argument to a function :
我们正在使用multiprocessing
库 ( python 3.6
)运行一个脚本,其中一个 bigpd.DataFrame
作为参数传递给一个函数:
from multiprocessing import Pool
import time
def my_function(big_df):
# do something time consuming
time.sleep(50)
if __name__ == '__main__':
with Pool(10) as p:
res = {}
output = {}
for id, big_df in some_dict_of_big_dfs:
res[id] = p.apply_async(my_function,(big_df ,))
output = {u : res[id].get() for id in id_list}
The problem is that we are getting an error from the pickle
library.
问题是我们从pickle
库中得到一个错误。
Reason: 'OverflowError('cannot serialize a bytes objects larger than 4GiB',)'
原因:'OverflowError('无法序列化大于 4GiB 的字节对象',)'
We are aware than pickle v4
can serialize larger objects question related, link, but we don't know how to modify the protocol that multiprocessing
is using.
我们知道pickle v4
可以序列化更大的对象问题相关,链接,但我们不知道如何修改multiprocessing
正在使用的协议。
does anybody know what to do? Thanks !!
有人知道该怎么做吗?谢谢 !!
回答by Pablo
Apparently is there an open issueabout this topic , and there is a few related initiatives described on this particular answer. I Found a way to change the default pickle
protocol that is used in the multiprocessing
library based on this answer. As was pointed out in the comments this solution Only works with Linux and OS multiprocessing lib
显然,关于这个主题有一个未解决的问题,并且在这个特定的答案中描述了一些相关的举措。我找到了一种方法来更改基于此答案库中pickle
使用的默认协议。正如评论中指出的那样,该解决方案仅适用于 Linux 和 OS 多处理库multiprocessing
Solution:
解决方案:
You first create a new separated module
您首先创建一个新的分离模块
pickle4reducer.py
pickle4reducer.py
from multiprocessing.reduction import ForkingPickler, AbstractReducer
class ForkingPickler4(ForkingPickler):
def __init__(self, *args):
if len(args) > 1:
args[1] = 2
else:
args.append(2)
super().__init__(*args)
@classmethod
def dumps(cls, obj, protocol=4):
return ForkingPickler.dumps(obj, protocol)
def dump(obj, file, protocol=4):
ForkingPickler4(file, protocol).dump(obj)
class Pickle4Reducer(AbstractReducer):
ForkingPickler = ForkingPickler4
register = ForkingPickler4.register
dump = dump
And then, in your main script you need to add the following:
然后,在您的主脚本中,您需要添加以下内容:
import pickle4reducer
import multiprocessing as mp
ctx = mp.get_context()
ctx.reducer = pickle4reducer.Pickle4Reducer()
with mp.Pool(4) as p:
# do something
That will probably solve the problem of the overflow.
这可能会解决溢出的问题。
But, warning, you might consider reading thisbefore doing anythingor you might reach the same error as me:
但是,警告,您可能会考虑在做任何事情之前阅读本文,否则您可能会遇到与我相同的错误:
'i' format requires -2147483648 <= number <= 2147483647
'i' 格式需要 -2147483648 <= number <= 2147483647
(the reason of this error is well explained in the linkabove). Long story short, multiprocessing
send data through all its process using the pickle
protocol, if you are already reaching the 4gb
limit, that probably means that you might consider redefining your functions more as "void" methods rather than input/output methods. All this inbound/outbound data increase the RAM usage, is probably inefficient by construction (my case) and it might be better to point all process to the same object rather than create a new copy for each call.
(这个错误的原因在上面的链接中有很好的解释)。长话短说,multiprocessing
使用pickle
协议在其所有过程中发送数据,如果您已经达到4gb
限制,那可能意味着您可能会考虑将您的函数重新定义为“无效”方法而不是输入/输出方法。所有这些入站/出站数据都会增加 RAM 使用量,可能是构造效率低下(我的情况),最好将所有进程指向同一个对象,而不是为每个调用创建一个新副本。
hope this helps.
希望这可以帮助。