pandas 大熊猫数据帧并行处理

Question

提问by autodidacticon

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

我正在访问一个非常大的 Pandas 数据框作为全局变量。该变量通过joblib并行访问。

Eg.

例如。

df = db.query("select id, a_lot_of_data from table")

def process(id):
    temp_df = df.loc[id]
    temp_df.apply(another_function)

Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())

Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

以这种方式访问原始 df 似乎跨进程复制数据。这是出乎意料的，因为原始 df 在任何子流程中都没有被改变？（或者是吗？）

Answer 1

回答by Kevin S

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.

整个DataFrame 需要为joblib 创建的每个进程进行pickle 和unpickled。在实践中，这很慢，而且每个都需要很多倍的内存。

One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use selectto select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.

一种解决方案是df.to_hdf使用表格格式将数据存储在 HDF ( ) 中。然后，您可以使用select选择数据子集以进行进一步处理。实际上，这对于交互式使用来说太慢了。它也非常复杂，您的工作人员需要存储他们的工作，以便在最后一步进行整合。

An alternative would be to explore numba.vectorizewith target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

另一种方法是numba.vectorize使用target='parallel'. 这将需要使用 NumPy 数组而不是 Pandas 对象，因此它也有一些复杂性成本。

In the long run, daskis hoped to bring parallel execution to Pandas, but this is not something to expect soon.

从长远来看，dask有望为 Pandas 带来并行执行，但这并不是很快就会发生的事情。

Answer 2

回答by Randy

Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmapas mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

正如您所指出的，Python 多处理通常使用单独的进程完成，这意味着这些进程不共享内存。如果您可以np.memmap像 joblib 文档中提到的那样使用一些潜在的解决方法，尽管转储到磁盘显然会增加其自身的一些开销：https://pythonhosted.org/joblib/parallel.html#working -with-numerical-data-in-shared-memory-memmaping

pandas 大熊猫数据帧并行处理

提问by autodidacticon

回答by Kevin S

回答by Randy

相关推荐

最近更新

标签

pandas 大熊猫数据帧并行处理

提问by autodidacticon

回答by Kevin S

回答by Randy

相关推荐

Pandas - 给定特定 b 的条件概率

pandas 如何将日期时间对象转换为毫秒

使用 Pandas 读取 Excel XML .xls 文件

pandas Pyspark .toPandas() 导致对象列中预期的数字为 1

相关推荐

最近更新

标签