带有 Pandas 的高性能笛卡尔积(CROSS JOIN)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53699012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performant cartesian product (CROSS JOIN) with pandas
提问by cs95
The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.
这篇文章的内容最初是作为Pandas Merging 101的一部分 ,但由于完全公正地处理这个主题所需的内容的性质和大小,它已移至自己的 QnA。
Given two simple DataFrames;
给定两个简单的 DataFrame;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
可以计算这些帧的叉积,如下所示:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
计算此结果的最高效方法是什么?
回答by cs95
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
让我们从建立一个基准开始。解决此问题的最简单方法是使用临时“键”列:
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge
then performs a many-to-many JOIN on "key".
其工作原理是为两个 DataFrame 分配一个具有相同值(例如 1)的临时“键”列。merge
然后对“键”执行多对多 JOIN。
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
虽然多对多 JOIN 技巧适用于合理大小的数据帧,但您会看到在较大数据上的性能相对较低。
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.
更快的实现将需要 NumPy。以下是1D 笛卡尔积的一些著名NumPy 实现。我们可以建立在其中一些高性能解决方案的基础上,以获得我们想要的输出。然而,我最喜欢的是@senderle 的第一个实现。
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique orNon-Unique Indexed DataFrames
概括:唯一或非唯一索引数据帧上的交叉连接
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!
免责声明
这些解决方案针对具有非混合标量 dtypes 的 DataFrame 进行了优化。如果处理混合 dtypes,请自担风险!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product
, use this to reindex the DataFrames, and
这个技巧适用于任何类型的 DataFrame。我们使用前面提到的 计算 DataFrames 数字索引的笛卡尔积cartesian_product
,使用它来重新索引 DataFrames,和
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
而且,沿着类似的路线,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
此解决方案可以推广到多个 DataFrame。例如,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
进一步简化
A simpler solution not involving @senderle's cartesian_product
is possible when dealing with just twoDataFrames. Using np.broadcast_arrays
, we can achieve almost the same level of performance.
cartesian_product
在仅处理两个DataFrame时,可以使用不涉及 @senderle 的更简单的解决方案。使用np.broadcast_arrays
,我们可以达到几乎相同的性能水平。
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
性能比较
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
在具有唯一索引的一些人为数据帧上对这些解决方案进行基准测试,我们有
Do note that timings may vary based on your setup, data, and choice of cartesian_product
helper function as applicable.
请注意,时间可能会因您的设置、数据和cartesian_product
辅助函数的选择而异(如适用)。
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
性能基准代码
这是计时脚本。这里调用的所有函数都在上面定义。
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
回答by YOBEN_S
Using itertools
product
and recreate the value in dataframe
使用itertools
product
和重新创建数据帧中的值
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
回答by Bharath
Here's an approach with triple concat
这是三重的方法 concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50