在多核机器上加速 Pandas
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43423311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Speed up Pandas on Multi-core machine
提问by Allen
I have a pandas data frame that fits comfortably in memory. I do serval maps on the data frame, but each map is time-consuming due to the complexity of the call-back functions passed to map. I own a AWS C4 instance, which is 8-core and 16GB-RAM. I ran the python script on the machine and found that more than 80% of CPU time is idle. So, I think (correct me if I am not right) the python script is single-threaded and only consume 1 core. Is there a way to speed up pandas on multi-core machine? Here is the snippet of the two time-consuming maps
我有一个适合内存的Pandas数据框。我在数据框上做了 serval 映射,但是由于传递给 map 的回调函数的复杂性,每个映射都很耗时。我拥有一个 AWS C4 实例,它是 8 核和 16GB-RAM。我在机器上运行python脚本,发现80%以上的CPU时间是空闲的。所以,我认为(如果我不对,请纠正我)python 脚本是单线程的,并且只消耗 1 个核心。有没有办法在多核机器上加速Pandas?这是两张耗时的地图的片段
tfidf_features = df.apply(lambda r: compute_tfidf_features(r.q1_tfidf_bow, r.q2_tfidf_bow), axis=1)
bin_features = df.apply(lambda r: compute_bin_features(r.q1_bin_bow, r.q2_bin_bow), axis=1)
Here is the compute_tfidf_features
function
这是compute_tfidf_features
函数
def compute_tfidf_features(sparse1, sparse2):
nparray1 = sparse1.toarray()[0]
nparray2 = sparse2.toarray()[0]
features = pd.Series({
'bow_tfidf_sum1': np.sum(sparse1),
'bow_tfidf_sum2': np.sum(sparse2),
'bow_tfidf_mean1': np.mean(sparse1),
'bow_tfidf_mean2': np.mean(sparse2),
'bow_tfidf_cosine': cosine(nparray1, nparray2),
'bow_tfidf_jaccard': real_jaccard(nparray1, nparray2),
'bow_tfidf_sym_kl_divergence': sym_kl_div(nparray1, nparray2),
'bow_tfidf_pearson': pearsonr(nparray1, nparray2)[0]
})
return features
I am aware of a python library called dask, but it says that it's not intended for a data frame that can comfortably fit in memory.
我知道一个名为 dask 的 python 库,但它表示它不适用于可以轻松放入内存的数据框。
回答by Eilif Mikkelsen
Pandas does not support this. Daskarrays are mostly API compatible with Pandas and support parallel execution for apply
.
Pandas不支持这个。Dask数组主要与 Pandas API 兼容,并支持并行执行apply
.
You might also consider some bleeding edge solutions such as this new tool
您还可以考虑一些前沿解决方案,例如这个新工具