pandas 大熊猫数据帧并行处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33612935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:11:43  来源:igfitidea点击:

Large Pandas Dataframe parallel processing

pythonpandasjoblib

提问by autodidacticon

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

我正在访问一个非常大的 Pandas 数据框作为全局变量。该变量通过joblib并行访问。

Eg.

例如。

df = db.query("select id, a_lot_of_data from table")

def process(id):
    temp_df = df.loc[id]
    temp_df.apply(another_function)

Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())

Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

以这种方式访问​​原始 df 似乎跨进程复制数据。这是出乎意料的,因为原始 df 在任何子流程中都没有被改变?(或者是吗?)

回答by Kevin S

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.

整个DataFrame 需要为joblib 创建的每个进程进行pickle 和unpickled。在实践中,这很慢,而且每个都需要很多倍的内存。

One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use selectto select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.

一种解决方案是df.to_hdf使用表格格式将数据存储在 HDF ( ) 中。然后,您可以使用select选择数据子集以进行进一步处理。实际上,这对于交互式使用来说太慢了。它也非常复杂,您的工作人员需要存储他们的工作,以便在最后一步进行整合。

An alternative would be to explore numba.vectorizewith target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

另一种方法是numba.vectorize使用target='parallel'. 这将需要使用 NumPy 数组而不是 Pandas 对象,因此它也有一些复杂性成本。

In the long run, daskis hoped to bring parallel execution to Pandas, but this is not something to expect soon.

从长远来看,dask有望为 Pandas 带来并行执行,但这并不是很快就会发生的事情。

回答by Randy

Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmapas mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

正如您所指出的,Python 多处理通常使用单独的进程完成,这意味着这些进程不共享内存。如果您可以np.memmap像 joblib 文档中提到的那样使用一些潜在的解决方法,尽管转储到磁盘显然会增加其自身的一些开销:https://pythonhosted.org/joblib/parallel.html#working -with-numerical-data-in-shared-memory-memmaping