pandas 熊猫中的多处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37491486/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Multiprocessing in pandas
提问by Michael Tamillow
Is it possible to partition a pandas dataframe to do multiprocessing?
是否可以对 Pandas 数据帧进行分区以进行多处理?
Specifically, my DataFrames are simply too big and take several minutes to run even one transformation on a single processor.
具体来说,我的 DataFrame 太大了,甚至在单个处理器上运行一个转换都需要几分钟的时间。
I know, I could do this in Spark but a lot of code has already been written, so preferably I would like to stick with what I have and get parallel functionality.
我知道,我可以在 Spark 中做到这一点,但已经编写了很多代码,所以我最好坚持使用我现有的并获得并行功能。
回答by Victor Lira
Slightly modifying https://stackoverflow.com/a/29281494/5351271I could get a solution to work over rows.
稍微修改https://stackoverflow.com/a/29281494/5351271我可以得到一个解决方案来处理行。
from multiprocessing import Pool, cpu_count
def applyParallel(dfGrouped, func):
with Pool(cpu_count()) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pandas.concat(ret_list)
def apply_row_foo(input_df):
return input_df.apply((row_foo), axis=1)
n_chunks = 10
grouped = df.groupby(df.index // n_chunks)
applyParallel(grouped, apply_row_foo)
If the index is not merely a row number, just group by np.arange(len(df)) // n_chunks
如果索引不只是行号,只需按 np.arange(len(df)) // n_chunks 分组
Decidedly not elegant, but worked in my use case.
决定不优雅,但在我的用例中工作。