pandas python pandas数据帧线程安全吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13592618/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:30:38  来源:igfitidea点击:

python pandas dataframe thread safe?

pythonthread-safetypandas

提问by Andrew

I am using multiple threads to access and delete data in my pandas dataframe. Because of this, I am wondering is pandas dataframe threadsafe?

我正在使用多个线程来访问和删除我的 Pandas 数据框中的数据。因此,我想知道Pandas数据帧线程安全吗?

采纳答案by Wes McKinney

The data in the underlying ndarrays can be accessed in a threadsafe manner, and modified at your own risk. Deleting data would be difficult as changing the size of a DataFrame usually requires creating a new object. I'd like to change this at some point in the future.

可以以线程安全的方式访问底层 ndarray 中的数据,并自行承担修改风险。删除数据会很困难,因为更改 DataFrame 的大小通常需要创建一个新对象。我想在未来的某个时候改变这一点。

回答by user48956

No, pandas is not thread safe. And its not thread safe in surprising ways.

不,pandas 不是线程安全的。而且它以令人惊讶的方式不是线程安全的。

  • Can I delete from pandas dataframe while another thread is using?
  • 当另一个线程正在使用时,我可以从 Pandas 数据帧中删除吗?

Fuggedaboutit! Nope. And generally no. Not even for GIL-locked python datastructures.

胡说八道!不。一般不会。甚至不适用于 GIL 锁定的 Python 数据结构。

  • Can I read from a pandas object while someone else is writing to it?
  • Can I copy a pandas dataframe in my thread, and work on the copy?
  • 我可以在其他人写入时读取 pandas 对象吗?
  • 我可以在我的线程中复制一个 Pandas 数据框,然后处理副本吗?

Definitely not. There's a long standing open issue: https://github.com/pandas-dev/pandas/issues/2728

当然不。有一个长期悬而未决的问题:https: //github.com/pandas-dev/pandas/issues/2728

Actually I think this is pretty reasonable (i.e. expected) behavior. I wouldn't expect to be able to simultaneouls write and read from, or copy, any datastructure unless either: i) it had been designed for concurrency, or ii) I have an exclusive lock on that object and all the view objects derived from it(.loc, .ilocare views and pandas has may others).

实际上我认为这是非常合理的(即预期的)行为。我不希望能够同时写入和读取或复制任何数据结构,除非:i)它是为并发而设计的,或者 ii)我对该对象和所有派生自的视图对象有一个排他锁.loc,.iloc是观点,而大Pandas可能有其他观点)。

  • Can I read from a pandas object while no-one else is writing to it?
  • 我可以在没有其他人写入的情况下读取 pandas 对象吗?

For almost all data structures in Python, the answer is yes. For pandas, no. And it seems, its not a design goal at present.

对于 Python 中的几乎所有数据结构,答案都是肯定的。对于Pandas,没有。看起来,它目前不是一个设计目标。

Typically, you can perform 'reading' operations on objects if no-one is performing mutating operations. You have to be a little cautious though. Some datastructures, including pandas, perform memoization, to cache expensive operations that are otherwise functionally pure. Its generally easy to implement lockless memoization in Python:

通常,如果没有人执行变异操作,您可以对对象执行“读取”操作。不过,您必须谨慎一些。一些数据结构,包括 Pandas,执行记忆化,以缓存昂贵的操作,否则在功能上是纯的。在 Python 中实现无锁记忆通常很容易:

@property
def thing(self):
    if _thing is MISSING:
        self._thing = self._calc_thing()
    return self._thing

... it simple and safe (assuming assignment is safely atomic -- which has not always been the case for every language, but is in CPython, unless you override setattribute).

...它既简单又安全(假设赋值是安全的原子性 - 并非每种语言都如此,但在 CPython 中,除非您覆盖setattribute)。

Pandas, series and dataframe indexes are computed lazily, on first use. I hope (but I do not see guarantees in the docs), that they're done in a similar safe way.

Pandas、series 和 dataframe 索引在第一次使用时是惰性计算的。我希望(但我没有在文档中看到保证),它们以类似的安全方式完成。

For all libraries (including pandas) I would hopethat all types of read-only operations (or more specifically, 'functionally pure' operations) would be thread safe if no-one is performing mutating operations. I think this is a 'reasonable' easily-achievable, common, lower-bar for thread safeness.

对于所有库(包括Pandas),如果没有人执行变异操作,我希望所有类型的只读操作(或更具体地说,“功能纯”操作)都是线程安全的。我认为这是一个“合理的”容易实现的、通用的、线程安全性较低的标准。

For pandas, however, you cannotassume this. Even if you can guarantee no-one is performing 'functionally impure' operations on your object (e.g. writing to cells, adding/deleting columns'), pandas is not thread safe.

但是,对于Pandas,您不能假设这一点。即使您可以保证没有人对您的对象执行“功能上不纯”的操作(例如写入单元格、添加/删除列),pandas 也不是线程安全的。

Here's a recent example: https://github.com/pandas-dev/pandas/issues/25870(its marked as a duplicate of the .copy-not-threadsafe issue, but it seems it could be a separate issue).

这是最近的一个例子:https: //github.com/pandas-dev/pandas/issues/25870(它被标记为 .copy-not-threadsafe 问题的重复,但它似乎可能是一个单独的问题)。

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # Exception !

... fails for f(s): s.reindex(..., copy=True), which returns it's result a as new object -- you would think it would be functionally pure and thread safe. Unfortunately, it is not.

... 失败 for f(s): s.reindex(..., copy=True),它将结果作为新对象返回 - 您会认为它在功能上是纯的并且是线程安全的。不幸的是,事实并非如此。

The result of this is that we could not use pandas in production for our healthcare analytics system - and I now discourage it for internal development since it makes in-memory parallelization of read-only operations unsafe. (!!)

这样做的结果是,我们无法在生产中为我们的医疗保健分析系统使用 Pandas——我现在不鼓励将它用于内部开发,因为它使只读操作的内存中并行化变得不安全。(!!)

The reindexbehavior is weird and surprising. If anyone has ideas about why it fails, please answer here: What's the source of thread-unsafety in this usage of pandas.Series.reindex(, copy=True)?

这种reindex行为既奇怪又令人惊讶。如果有人对其失败的原因有任何想法,请在此处回答: pandas.Series.reindex(, copy=True) 的这种用法中线程不安全的根源是什么?

The maintainers marked this as a duplicate of https://github.com/pandas-dev/pandas/issues/2728. I'm suspicious, but if .copyis the source, then almost all of pandasis not thread safe in any situation (which is their advice).

维护者将此标记为https://github.com/pandas-dev/pandas/issues/2728的副本 。我很怀疑,但如果.copy是来源,那么几乎所有的Pandas在任何情况下都不是线程安全的(这是他们的建议)。

!