在多核机器上加速 Pandas

Question

提问by Allen

I have a pandas data frame that fits comfortably in memory. I do serval maps on the data frame, but each map is time-consuming due to the complexity of the call-back functions passed to map. I own a AWS C4 instance, which is 8-core and 16GB-RAM. I ran the python script on the machine and found that more than 80% of CPU time is idle. So, I think (correct me if I am not right) the python script is single-threaded and only consume 1 core. Is there a way to speed up pandas on multi-core machine? Here is the snippet of the two time-consuming maps

我有一个适合内存的Pandas数据框。我在数据框上做了 serval 映射，但是由于传递给 map 的回调函数的复杂性，每个映射都很耗时。我拥有一个 AWS C4 实例，它是 8 核和 16GB-RAM。我在机器上运行python脚本，发现80%以上的CPU时间是空闲的。所以，我认为（如果我不对，请纠正我）python 脚本是单线程的，并且只消耗 1 个核心。有没有办法在多核机器上加速Pandas？这是两张耗时的地图的片段

 tfidf_features = df.apply(lambda r: compute_tfidf_features(r.q1_tfidf_bow, r.q2_tfidf_bow), axis=1)
 bin_features = df.apply(lambda r: compute_bin_features(r.q1_bin_bow, r.q2_bin_bow), axis=1)

Here is the compute_tfidf_featuresfunction

这是compute_tfidf_features函数

def compute_tfidf_features(sparse1, sparse2):
    nparray1 = sparse1.toarray()[0]
    nparray2 = sparse2.toarray()[0]

    features = pd.Series({
    'bow_tfidf_sum1': np.sum(sparse1),
    'bow_tfidf_sum2': np.sum(sparse2),
    'bow_tfidf_mean1': np.mean(sparse1),
    'bow_tfidf_mean2': np.mean(sparse2),
    'bow_tfidf_cosine': cosine(nparray1, nparray2),
    'bow_tfidf_jaccard': real_jaccard(nparray1, nparray2),
    'bow_tfidf_sym_kl_divergence': sym_kl_div(nparray1, nparray2),
    'bow_tfidf_pearson': pearsonr(nparray1, nparray2)[0]
    })

    return features

I am aware of a python library called dask, but it says that it's not intended for a data frame that can comfortably fit in memory.

我知道一个名为 dask 的 python 库，但它表示它不适用于可以轻松放入内存的数据框。

Answer 1

回答by Eilif Mikkelsen

Pandas does not support this. Daskarrays are mostly API compatible with Pandas and support parallel execution for apply.

Pandas不支持这个。Dask数组主要与 Pandas API 兼容，并支持并行执行apply.

You might also consider some bleeding edge solutions such as this new tool

您还可以考虑一些前沿解决方案，例如这个新工具

在多核机器上加速 Pandas

提问by Allen

回答by Eilif Mikkelsen

相关推荐

最近更新

标签

在多核机器上加速 Pandas

提问by Allen

回答by Eilif Mikkelsen

相关推荐

Python pandas -> 按列名中的条件选择

pandas 加入数据帧 - 一个有多索引列，另一个没有

pandas 如何根据Python中的两个条件更改列的值

向 Pandas 数据透视表添加过滤器

相关推荐

最近更新

标签