pandas 如何并行合并两个熊猫数据帧(多线程或多处理)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35785109/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to merge two pandas dataframe in parallel (multithreading or multiprocessing)
提问by Lav Patel
Without doing in parallel programming I can merger left and right dataframe on key
column using below code, but it will be too slow since both are very large. is there any way I can do it in parallelize efficiently ?
如果不进行并行编程,我可以key
使用下面的代码合并列上的左右数据框,但它会太慢,因为两者都非常大。有什么办法可以有效地并行化吗?
I have 64 cores, and so practically I can use 63 of them to merge these two dataframe.
我有 64 个内核,所以实际上我可以使用其中的 63 个来合并这两个数据帧。
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key')
output will be :
输出将是:
left:
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
right:
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3
result:
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
I want to do this in parallel so I can do it at speed.
我想并行执行此操作,以便我可以快速执行此操作。
回答by jezrael
I believe you can use dask.
and function merge
.
Docssay:
文档说:
What definitely works?
什么绝对有效?
Cleverly parallelizable operations (also fast):
Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
巧妙的并行化操作(也很快):
加入索引:dd.merge(df1, df2, left_index=True, right_index=True)
Or:
或者:
Operations requiring a shuffle (slow-ish, unless on index)
Set index: df.set_index(df.x)
Join not on the index: pd.merge(df1, df2, on='name')
需要洗牌的操作(慢,除非在索引上)
设置索引:df.set_index(df.x)
加入不在索引上:pd.merge(df1, df2, on='name')
You can also check how Create Dask DataFrames.
您还可以检查如何创建 Dask DataFrames。
Example
例子
import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key')
print result
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
import dask.dataframe as dd
#Construct a dask objects from a pandas objects
left1 = dd.from_pandas(left, npartitions=3)
right1 = dd.from_pandas(right, npartitions=3)
#merge on key
print dd.merge(left1, right1, on='key').compute()
A B key C D
0 A3 B3 K3 C3 D3
1 A1 B1 K1 C1 D1
0 A2 B2 K2 C2 D2
1 A0 B0 K0 C0 D0
#first set indexes and then merge by them
print dd.merge(left1.set_index('key').compute(),
right1.set_index('key').compute(),
left_index=True,
right_index=True)
A B C D
key
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
回答by Gustavo Bezerra
You can improve the speed (by a factor of about 3 on the given example) of your merge by making the key
column the index of your dataframes and using join
instead.
您可以通过使key
列成为数据帧的索引并join
改为使用来提高合并速度(在给定示例中大约为 3 倍)。
left2 = left.set_index('key')
right2 = right.set_index('key')
In [46]: %timeit result2 = left2.join(right2)
1000 loops, best of 3: 361 μs per loop
In [47]: %timeit result = pd.merge(left, right, on='key')
1000 loops, best of 3: 1.01 ms per loop