pandas 熊猫中的多处理

Question

提问by Michael Tamillow

Is it possible to partition a pandas dataframe to do multiprocessing?

是否可以对 Pandas 数据帧进行分区以进行多处理？

Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.

具体来说，我的 DataFrame 太大了，甚至在单个处理器上运行一个转换都需要几分钟的时间。

I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.

我知道，我可以在 Spark 中做到这一点，但已经编写了很多代码，所以我最好坚持使用我现有的并获得并行功能。

Answer 1

回答by Victor Lira

Slightly modifying https://stackoverflow.com/a/29281494/5351271I could get a solution to work over rows.

稍微修改https://stackoverflow.com/a/29281494/5351271我可以得到一个解决方案来处理行。

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):
    with Pool(cpu_count()) as p:
        ret_list = p.map(func, [group for name, group in dfGrouped])
    return pandas.concat(ret_list)

def apply_row_foo(input_df):
    return input_df.apply((row_foo), axis=1)

n_chunks = 10

grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)

If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks

如果索引不只是行号，只需按 np.arange(len(df)) // n_chunks 分组

Decidedly not elegant, but worked in my use case.

决定不优雅，但在我的用例中工作。

pandas 熊猫中的多处理

提问by Michael Tamillow

回答by Victor Lira

相关推荐

最近更新

标签

pandas 熊猫中的多处理

提问by Michael Tamillow

回答by Victor Lira

相关推荐

pandas 从熊猫数据框中绘制累积图表？

将 python xgboost dMatrix 转换为 numpy ndarray 或 pandas DataFrame

pandas 熊猫数据帧对角线

从 Pandas 到 Statsmodels 的 OLS 中已弃用的滚动窗口选项

相关推荐

最近更新

标签