pandas 如何并行合并两个熊猫数据帧（多线程或多处理）

Question

提问by Lav Patel

Without doing in parallel programming I can merger left and right dataframe on keycolumn using below code, but it will be too slow since both are very large. is there any way I can do it in parallelize efficiently ?

如果不进行并行编程，我可以key使用下面的代码合并列上的左右数据框，但它会太慢，因为两者都非常大。有什么办法可以有效地并行化吗？

I have 64 cores, and so practically I can use 63 of them to merge these two dataframe.

我有 64 个内核，所以实际上我可以使用其中的 63 个来合并这两个数据帧。

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})


right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})


result = pd.merge(left, right, on='key')

output will be :

输出将是：

left:
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3

right:
    C   D key
0  C0  D0  K0
1  C1  D1  K1
2  C2  D2  K2
3  C3  D3  K3

result:
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

I want to do this in parallel so I can do it at speed.

我想并行执行此操作，以便我可以快速执行此操作。

Answer 1

回答by jezrael

I believe you can use dask. and function merge.

我相信你可以使用dask。和功能merge。

Docssay:

文档说：

What definitely works?

什么绝对有效？

Cleverly parallelizable operations (also fast):
Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

巧妙的并行化操作（也很快）：
加入索引：dd.merge(df1, df2, left_index=True, right_index=True)

Or:

或者：

Operations requiring a shuffle (slow-ish, unless on index)
Set index: df.set_index(df.x)
Join not on the index: pd.merge(df1, df2, on='name')

需要洗牌的操作（慢，除非在索引上）
设置索引：df.set_index(df.x)
加入不在索引上：pd.merge(df1, df2, on='name')

You can also check how Create Dask DataFrames.

您还可以检查如何创建 Dask DataFrames。

Example

例子

import pandas as pd

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})


right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})


result = pd.merge(left, right, on='key')
print result
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3

import dask.dataframe as dd

#Construct a dask objects from a pandas objects
left1 = dd.from_pandas(left, npartitions=3)
right1 = dd.from_pandas(right, npartitions=3)

#merge on key
print dd.merge(left1, right1, on='key').compute()
    A   B key   C   D
0  A3  B3  K3  C3  D3
1  A1  B1  K1  C1  D1
0  A2  B2  K2  C2  D2
1  A0  B0  K0  C0  D0

#first set indexes and then merge by them
print dd.merge(left1.set_index('key').compute(), 
               right1.set_index('key').compute(), 
               left_index=True, 
               right_index=True)
      A   B   C   D
key                
K0   A0  B0  C0  D0
K1   A1  B1  C1  D1
K2   A2  B2  C2  D2
K3   A3  B3  C3  D3

Answer 2

回答by Gustavo Bezerra

You can improve the speed (by a factor of about 3 on the given example) of your merge by making the keycolumn the index of your dataframes and using joininstead.

您可以通过使key列成为数据帧的索引并join改为使用来提高合并速度（在给定示例中大约为 3 倍）。

left2 = left.set_index('key')
right2 = right.set_index('key')

In [46]: %timeit result2 = left2.join(right2)
1000 loops, best of 3: 361 μs per loop

In [47]: %timeit result = pd.merge(left, right, on='key')
1000 loops, best of 3: 1.01 ms per loop

pandas 如何并行合并两个熊猫数据帧（多线程或多处理）

提问by Lav Patel

回答by jezrael

回答by Gustavo Bezerra

相关推荐

最近更新

标签

pandas 如何并行合并两个熊猫数据帧（多线程或多处理）

提问by Lav Patel

回答by jezrael

回答by Gustavo Bezerra

相关推荐

pandas 在 Plotly 中向条形图添加注释

pandas 使用jaccard相似度的Python Pandas距离矩阵

Pandas - 过滤器和正则表达式搜索 DataFrame 的索引

在 Pandas DataFrame 中转换列值的最有效方法

相关推荐

最近更新

标签