python 如何将python dict与多处理同步
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2545961/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to synchronize a python dict with multiprocessing
提问by Peter Smit
I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
我正在使用 Python 2.6 和用于多线程的多处理模块。现在我想要一个同步的字典(我真正需要的唯一原子操作是值上的 += 运算符)。
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?
我应该用 multiprocessing.sharedctypes.synchronized() 调用包装 dict 吗?或者是另一种方式?
回答by manifest
Intro
介绍
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyromodule (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
似乎有很多扶手椅建议,但没有工作示例。这里列出的答案甚至都没有建议使用多处理,这有点令人失望和不安。作为 Python 爱好者,我们应该支持我们的内置库,虽然并行处理和同步从来都不是一件小事,但我相信通过适当的设计可以使之变得简单。这在现代多核架构中变得极其重要,怎么强调都不过分!也就是说,我对多处理库并不满意,因为它仍处于起步阶段,有很多陷阱、错误,并且面向函数式编程(我讨厌它)。目前我还是更喜欢火焰兵由于多处理的严重限制无法在服务器运行时共享新创建的对象,因此模块(这远远领先于它的时间)超过了多处理。管理器对象的“注册”类方法只会在管理器(或其服务器)启动之前实际注册一个对象。废话不多说,再来点代码:
Server.py
服务器.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
在上面的代码示例中,Server.py 使用了可以提供同步共享对象的多处理的 SyncManager。这段代码在解释器中无法运行,因为多处理库对于如何为每个注册对象找到“可调用对象”非常敏感。运行 Server.py 将启动一个定制的 SyncManager,它共享同步字典以供多个进程使用,并且可以连接到同一台机器上的客户端,或者如果在环回以外的 IP 地址上运行,则可以连接到其他机器。在这种情况下,服务器在端口 5000 上的环回 (127.0.0.1) 上运行。使用 authkey 参数在操作同步字典时使用安全连接。当按下任何键时,管理器将关闭。
Client.py
客户端.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
客户端还必须创建一个定制的 SyncManager,注册“syncdict”,这次没有传入一个可调用来检索共享 dict。然后,它使用自定义的 SycnManager 使用端口 5000 上的环回 IP 地址 (127.0.0.1) 和在 Server.py 中启动的与管理器建立安全连接的 authkey 进行连接。它通过调用管理器上注册的可调用对象来检索共享字典同步字典。它会提示用户进行以下操作:
- The key in syncdict to operate on
- The amount to increment the value accessed by the key every cycle
- The amount of time to sleep per cycle in seconds
- 要操作的同步键
- 每个周期增加键访问的值的数量
- 每个周期的睡眠时间(以秒为单位)
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
然后客户端检查密钥是否存在。如果不是,它会在同步字典上创建密钥。客户端然后进入一个“无限”循环,它通过增量更新键的值,休眠指定的数量,并打印同步字典仅重复此过程,直到发生键盘中断 (Ctrl+C)。
Annoying problems
烦人的问题
- The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
- All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
- Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
- The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
- 必须在管理器启动之前调用管理器的注册方法,否则即使对管理器的 dir 调用将显示它确实具有已注册的方法,您也会得到异常。
- dict 的所有操作都必须使用方法而不是 dict 赋值来完成(由于 multiprocessing 共享自定义对象的方式,syncdict["blast"] = 2 将悲惨地失败)
- 使用 SyncManager 的 dict 方法将缓解烦人的问题 #2,除了烦人的问题 #1 会阻止 SyncManager.dict() 返回的代理被注册和共享。(SyncManager.dict() 只能在管理器启动后调用,并且 register 只会在管理器启动前工作,因此 SyncManager.dict() 仅在进行函数式编程并将代理作为参数传递给 Processes 时才有用,例如文档示例做)
- 服务器和客户端都必须注册,即使直观上看起来客户端在连接到管理器后才能弄清楚(请将此添加到您的愿望清单多处理开发人员)
Closing
收盘
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.
我希望你和我一样喜欢这个非常彻底且有点耗时的答案。我在想清楚为什么我在 Pyro 让它变得轻而易举的多处理模块上如此挣扎时遇到了很多麻烦,现在多亏了这个答案,我才一针见血。我希望这对 python 社区如何改进多处理模块有用,因为我相信它有很大的希望,但在它的初期还没有达到可能。尽管描述了烦人的问题,但我认为这仍然是一个非常可行的替代方案,而且非常简单。您还可以使用 SyncManager.dict() 并将其作为文档显示的方式传递给 Processes ,它可能是一个更简单的解决方案,具体取决于您的要求,这对我来说是不自然的。
回答by Alex Martelli
I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclibto make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment
to perform the increment and one taking just the key
and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
我会用一个单独的进程来维护“共享字典”:只需使用例如xmlrpclib使少量代码可供其他进程使用,通过 xmlrpclib 公开,例如一个key, increment
用于执行增量的函数,一个只使用key
并返回值,带有语义细节(是否有缺少键的默认值等),具体取决于您的应用程序的需要。
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistentshared dict, or persistence is not necessary to you), then measure and optimize as and if needed.
然后,您可以使用任何您喜欢的方法来实现共享字典专用进程:从在内存中具有简单字典的单线程服务器到简单的 sqlite DB 等等,我建议您从代码开始“尽可能简单”(取决于您是否需要持久共享字典,或者持久性对您来说不是必需的),然后根据需要进行测量和优化。
回答by Frank V
In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this articleis suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
响应并发写入问题的适当解决方案。我做了非常快速的研究,发现这篇文章提出了一个锁/信号量解决方案。( http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
虽然该示例不是字典的特殊性,但我很确定您可以编写一个基于类的包装对象来帮助您使用基于这个想法的字典。
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
如果我需要以线程安全的方式实现类似的东西,我可能会使用 Python Semaphore 解决方案。(假设我之前的合并技术不起作用。)我相信信号量由于其阻塞性质通常会降低线程效率。
From the site:
从网站:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
信号量是一种更高级的锁定机制。信号量有一个内部计数器而不是锁定标志,并且只有在超过给定数量的线程试图持有信号量时它才会阻塞。根据信号量的初始化方式,这允许多个线程同时访问相同的代码段。
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter
回答by Frank V
Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
有没有理由首先需要共享字典?您是否可以让每个线程维护自己的字典实例,并在线程处理结束时合并,或者定期使用回调将各个线程字典的副本合并在一起?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.
我不知道你在做什么,所以请记住我的书面计划可能无法逐字执行。我的建议更像是一种高级设计理念。