pandas 和 numpy 线程安全
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25782912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas and numpy thread safety
提问by Emanuele Paolini
I'm using pandason a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.
我在pandas网络服务器(apache + modwsgi + django)上使用并且有一个难以重现的错误,现在我发现它是由Pandas不是线程安全引起的。
After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.
经过大量的代码缩减,我终于找到了一个简短的独立程序,可以用来重现问题。你可以在下面看到它。
The point is: contrary to the answer of this questionthis example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...
关键是:与这个问题的答案相反,这个例子表明,即使使用不修改数据帧的非常简单的操作,pandas 也会崩溃。我无法想象这个简单的代码片段如何可能对线程不安全......
The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)
问题是关于在 Web 服务器中使用 pandas 和 numpy。是否可以?我应该如何使用 Pandas 修复我的代码?(锁定用法的示例会有所帮助)
Here is the code which causes a Segmentation Fault:
这是导致分段错误的代码:
import threading
import pandas as pd
import numpy as np
def let_crash(crash=True):
t = 0.02 * np.arange(100000) # ok con 10000
data = pd.DataFrame({'t': t})
if crash:
data['t'] * 1.5 # CRASH
else:
data['t'].values * 1.5 # THIS IS OK!
if __name__ == '__main__':
threads = []
for i in range(100):
if True: # asynchronous
t = threading.Thread(target=let_crash, args = ())
t.daemon = True
t.start()
threads.append(t)
else: # synchronous
let_crash()
for t in threads:
t.join()
My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1
我的环境:python 2.7.3、numpy 1.8.0、pandas 0.13.1
回答by Jeff
see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety
请参阅此处文档中的警告:http: //pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety
pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.
pandas 不是线程安全的,因为底层的复制机制不是。我相信 Numpy 有一个原子复制操作,但 Pandas 在这之上有一个层。
Copy is the basis of pandas operations (as most operations generate a new object to return to the user)
复制是pandas操作的基础(因为大多数操作会生成一个新的对象返回给用户)
It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.
解决这个问题并非易事,并且会带来相当高的性能成本,因此需要做一些工作来正确处理这个问题。
Easiest is simply not to share objects across threads or lock them on usage.
最简单的就是不要跨线程共享对象或在使用时锁定它们。
回答by Graham Dumpleton
Configure mod_wsgi to run in a single thread mode.
配置 mod_wsgi 以单线程模式运行。
WSGIDaemonProcess mysite processes=5 threads=1
WSGIProcessGroup mysite
WSGIApplicationGroup %{GLOBAL}
In this case it is using mod_wsgi daemon mode so that processes/threads can be set independently on whatever Apache MPM you are using.
在这种情况下,它使用 mod_wsgi 守护进程模式,以便可以在您使用的任何 Apache MPM 上独立设置进程/线程。

