pandas 熊猫中的多处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37491486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:18:22  来源:igfitidea点击:

Multiprocessing in pandas

pythonpandasdataframeparallel-processingmultiprocessing

提问by Michael Tamillow

Is it possible to partition a pandas dataframe to do multiprocessing?

是否可以对 Pandas 数据帧进行分区以进行多处理?

Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.

具体来说,我的 DataFrame 太大了,甚至在单个处理器上运行一个转换都需要几分钟的时间。

I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.

我知道,我可以在 Spark 中做到这一点,但已经编写了很多代码,所以我最好坚持使用我现有的并获得并行功能。

回答by Victor Lira

Slightly modifying https://stackoverflow.com/a/29281494/5351271I could get a solution to work over rows.

稍微修改https://stackoverflow.com/a/29281494/5351271我可以得到一个解决方案来处理行。

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

def apply_row_foo(input_df):
    return input_df.apply((row_foo), axis=1)

n_chunks = 10

grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)

If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks

如果索引不只是行号,只需按 np.arange(len(df)) // n_chunks 分组

Decidedly not elegant, but worked in my use case.

决定不优雅,但在我的用例中工作。