dask.multiprocessing 或 pandas + multiprocessing.pool:有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46754524/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:38:15  来源:igfitidea点击:

dask.multiprocessing or pandas + multiprocessing.pool: what's the difference?

pythonmultithreadingpandasmultiprocessingdask

提问by ilpomo

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).

我正在开发一个用于财务目的的模型。我将整个 S&P500 组件放在一个文件夹中,存储了许多 .hdf 文件。每个 .hdf 文件都有自己的多索引(年-周-分钟)。

An example of the sequential code (non parallelized):

顺序代码示例(非并行化):

import os
from classAsset import Asset


def model(current_period, previous_perdiod):
    # do stuff on the current period, based on stats derived from previous_period
    return results

if __name__ == '__main__':
    for hdf_file in os.listdir('data_path'):
        asset = Asset(hdf_file)
        for year in asset.data.index.get_level_values(0).unique().values:
            for week in asset.data.loc[year].index.get_level_values(0).unique().values:

                previous_period = asset.data.loc[(start):(end)].Open.values  # start and end are defined in another function
                current_period = asset.data.loc[year, week].Open.values

                model(current_period, previous_period)

To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.

为了加快进程,我使用 multiprocessing.pool 同时在多个 .hdf 文件上运行相同的算法,所以我对处理速度非常满意(我有一个 4c/8t CPU)。但是现在我发现了Dask。

In Dask documentation 'DataFrame Overview'they indicate:

Dask 文档“DataFrame 概览”中,它们指出:

Trivially parallelizable operations (fast):

可简单并行的操作(快速)

  • Elementwise operations: df.x + df.y, df * df
  • Row-wise selections: df[df.x > 0]
  • Loc: df.loc[4.0:10.5] (this is what interests me the most)
  • 元素操作:df.x + df.y, df * df
  • 逐行选择:df[df.x > 0]
  • Loc: df.loc[4.0:10.5] (这是我最感兴趣的)

Also, in Dask documentation 'Use Cases'they indicate:

此外,在Dask 文档“用例”中,它们指出:

A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn't a single large array or dataframe.

They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.

程序员有一个函数,他们想在不同的输入上多次运行。它们的函数和输入可能在内部使用数组或数据帧,但从概念上讲,它们的问题不是单个大数组或数据帧。

他们希望在制作原型时在笔记本电脑上并行运行这些功能,但他们也打算最终使用内部集群。他们将他们的功能包装在 dask.delayed 中,并让适当的 dask 调度程序并行化和负载平衡工作。

So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?

所以我确定我错过了一些东西,或者可能不仅仅是一些东西。使用 multiprocessing.pool 和 dask.multiprocessing 处理许多单个 Pandas 数据帧有什么区别?

Do you think I should use Dask for my specific case? Thank you guys.

你认为我应该在我的具体情况下使用 Dask 吗?谢谢你们。

回答by MRocklin

There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism.

没有区别。Dask 正在做您在自定义代码中所做的事情。它使用 Pandas 和一个线程或多处理池来实现并行化。

You might prefer Dask for a few reasons

您可能出于以下几个原因更喜欢 Dask

  1. It would figure out how to write the parallel algorithms automatically
  2. You may want to scale to a cluster in the future
  1. 它会弄清楚如何自动编写并行算法
  2. 您可能希望将来扩展到集群

But if what you have works well for you then I would just stay with that.

但是,如果您拥有的东西对您很有效,那么我会坚持下去。