pandas 和 numpy 线程安全

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25782912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:27:21  来源:igfitidea点击:

pandas and numpy thread safety

pythonmultithreadingnumpypandas

提问by Emanuele Paolini

I'm using pandason a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.

我在pandas网络服务器(apache + modwsgi + django)上使用并且有一个难以重现的错误,现在我发现它是由Pandas不是线程安全引起的。

After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.

经过大量的代码缩减,我终于找到了一个简短的独立程序,可以用来重现问题。你可以在下面看到它。

The point is: contrary to the answer of this questionthis example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...

关键是:与这个问题的答案相反,这个例子表明,即使使用不修改数据帧的非常简单的操作,pandas 也会崩溃。我无法想象这个简单的代码片段如何可能对线程不安全......

The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)

问题是关于在 Web 服务器中使用 pandas 和 numpy。是否可以?我应该如何使用 Pandas 修复我的代码?(锁定用法的示例会有所帮助)

Here is the code which causes a Segmentation Fault:

这是导致分段错误的代码:

import threading
import pandas as pd
import numpy as np

def let_crash(crash=True):
    t = 0.02 * np.arange(100000) # ok con 10000                                                                               
    data = pd.DataFrame({'t': t})
    if crash:
        data['t'] * 1.5  # CRASH
    else:
        data['t'].values * 1.5  # THIS IS OK!

if __name__ == '__main__':
        threads = []
        for i in range(100):
            if True:  # asynchronous                                                                                          
                t = threading.Thread(target=let_crash, args = ())
                t.daemon = True
                t.start()
                threads.append(t)
            else:  # synchronous                                                                                              
                let_crash()
        for t in threads:
            t.join()

My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1

我的环境:python 2.7.3、numpy 1.8.0、pandas 0.13.1

回答by Jeff

see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

请参阅此处文档中的警告:http: //pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.

pandas 不是线程安全的,因为底层的复制机制不是。我相信 Numpy 有一个原子复制操作,但 Pandas 在这之上有一个层。

Copy is the basis of pandas operations (as most operations generate a new object to return to the user)

复制是pandas操作的基础(因为大多数操作会生成一个新的对象返回给用户)

It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.

解决这个问题并非易事,并且会带来相当高的性能成本,因此需要做一些工作来正确处理这个问题。

Easiest is simply not to share objects across threads or lock them on usage.

最简单的就是不要跨线程共享对象或在使用时锁定它们。

回答by Graham Dumpleton

Configure mod_wsgi to run in a single thread mode.

配置 mod_wsgi 以单线程模式运行。

WSGIDaemonProcess mysite processes=5 threads=1
WSGIProcessGroup mysite
WSGIApplicationGroup %{GLOBAL}

In this case it is using mod_wsgi daemon mode so that processes/threads can be set independently on whatever Apache MPM you are using.

在这种情况下,它使用 mod_wsgi 守护进程模式,以便可以在您使用的任何 Apache MPM 上独立设置进程/线程。