在多核机器上加速 Pandas

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43423311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:23:51  来源:igfitidea点击:

Speed up Pandas on Multi-core machine

multithreadingpython-3.xpandas

提问by Allen

I have a pandas data frame that fits comfortably in memory. I do serval maps on the data frame, but each map is time-consuming due to the complexity of the call-back functions passed to map. I own a AWS C4 instance, which is 8-core and 16GB-RAM. I ran the python script on the machine and found that more than 80% of CPU time is idle. So, I think (correct me if I am not right) the python script is single-threaded and only consume 1 core. Is there a way to speed up pandas on multi-core machine? Here is the snippet of the two time-consuming maps

我有一个适合内存的Pandas数据框。我在数据框上做了 serval 映射,但是由于传递给 map 的回调函数的复杂性,每个映射都很耗时。我拥有一个 AWS C4 实例,它是 8 核和 16GB-RAM。我在机器上运行python脚本,发现80%以上的CPU时间是空闲的。所以,我认为(如果我不对,请纠正我)python 脚本是单线程的,并且只消耗 1 个核心。有没有办法在多核机器上加速Pandas?这是两张耗时的地图的片段

 tfidf_features = df.apply(lambda r: compute_tfidf_features(r.q1_tfidf_bow, r.q2_tfidf_bow), axis=1)
 bin_features = df.apply(lambda r: compute_bin_features(r.q1_bin_bow, r.q2_bin_bow), axis=1)

Here is the compute_tfidf_featuresfunction

这是compute_tfidf_features函数

def compute_tfidf_features(sparse1, sparse2):
    nparray1 = sparse1.toarray()[0]
    nparray2 = sparse2.toarray()[0]

    features = pd.Series({
    'bow_tfidf_sum1': np.sum(sparse1),
    'bow_tfidf_sum2': np.sum(sparse2),
    'bow_tfidf_mean1': np.mean(sparse1),
    'bow_tfidf_mean2': np.mean(sparse2),
    'bow_tfidf_cosine': cosine(nparray1, nparray2),
    'bow_tfidf_jaccard': real_jaccard(nparray1, nparray2),
    'bow_tfidf_sym_kl_divergence': sym_kl_div(nparray1, nparray2),
    'bow_tfidf_pearson': pearsonr(nparray1, nparray2)[0]
    })

    return features

I am aware of a python library called dask, but it says that it's not intended for a data frame that can comfortably fit in memory.

我知道一个名为 dask 的 python 库,但它表示它不适用于可以轻松放入内存的数据框。

回答by Eilif Mikkelsen

Pandas does not support this. Daskarrays are mostly API compatible with Pandas and support parallel execution for apply.

Pandas不支持这个。Dask数组主要与 Pandas API 兼容,并支持并行执行apply.

You might also consider some bleeding edge solutions such as this new tool

您还可以考虑一些前沿解决方案,例如这个新工具