带有 Pandas 的高性能笛卡尔积（CROSS JOIN）

Question

提问by cs95

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.

这篇文章的内容最初是作为Pandas Merging 101的一部分，但由于完全公正地处理这个主题所需的内容的性质和大小，它已移至自己的 QnA。

Given two simple DataFrames;

给定两个简单的 DataFrame；

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

The cross product of these frames can be computed, and will look something like:

可以计算这些帧的叉积，如下所示：

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

What is the most performant method of computing this result?

计算此结果的最高效方法是什么？

Answer 1

回答by cs95

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

让我们从建立一个基准开始。解决此问题的最简单方法是使用临时“键”列：

def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). mergethen performs a many-to-many JOIN on "key".

其工作原理是为两个 DataFrame 分配一个具有相同值（例如 1）的临时“键”列。merge然后对“键”执行多对多 JOIN。

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

虽然多对多 JOIN 技巧适用于合理大小的数据帧，但您会看到在较大数据上的性能相对较低。

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

更快的实现将需要 NumPy。以下是1D 笛卡尔积的一些著名NumPy 实现。我们可以建立在其中一些高性能解决方案的基础上，以获得我们想要的输出。然而，我最喜欢的是@senderle 的第一个实现。

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique orNon-Unique Indexed DataFrames

概括：唯一或非唯一索引数据帧上的交叉连接

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

免责声明
这些解决方案针对具有非混合标量 dtypes 的 DataFrame 进行了优化。如果处理混合 dtypes，请自担风险！

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

这个技巧适用于任何类型的 DataFrame。我们使用前面提到的计算 DataFrames 数字索引的笛卡尔积cartesian_product，使用它来重新索引 DataFrames，和

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

And, along similar lines,

而且，沿着类似的路线，

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

此解决方案可以推广到多个 DataFrame。例如，

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

Further Simplification

进一步简化

A simpler solution not involving @senderle's cartesian_productis possible when dealing with just twoDataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

cartesian_product在仅处理两个DataFrame时，可以使用不涉及 @senderle 的更简单的解决方案。使用np.broadcast_arrays，我们可以达到几乎相同的性能水平。

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True

Performance Comparison

性能比较

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

在具有唯一索引的一些人为数据帧上对这些解决方案进行基准测试，我们有

Do note that timings may vary based on your setup, data, and choice of cartesian_producthelper function as applicable.

请注意，时间可能会因您的设置、数据和cartesian_product辅助函数的选择而异（如适用）。

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

性能基准代码
这是计时脚本。这里调用的所有函数都在上面定义。

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Answer 2

回答by YOBEN_S

Using itertoolsproductand recreate the value in dataframe

使用itertoolsproduct和重新创建数据帧中的值

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

Answer 3

回答by Bharath

Here's an approach with triple concat

这是三重的方法 concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

    col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50

带有 Pandas 的高性能笛卡尔积（CROSS JOIN）

提问by cs95

回答by cs95

Generalizing: CROSS JOIN on Unique orNon-Unique Indexed DataFrames

概括：唯一或非唯一索引数据帧上的交叉连接

Further Simplification

进一步简化

Performance Comparison

性能比较

回答by YOBEN_S

回答by Bharath

相关推荐

最近更新

标签

带有 Pandas 的高性能笛卡尔积（CROSS JOIN）

提问by cs95

回答by cs95

Generalizing: CROSS JOIN on Unique orNon-Unique Indexed DataFrames

概括：唯一或非唯一索引数据帧上的交叉连接

Further Simplification

进一步简化

Performance Comparison

性能比较

回答by YOBEN_S

回答by Bharath

相关推荐

带有值图、Pandas 和 MatPlotLib 的月、年

pandas 单个位置索引器是越界索引错误

pandas 从字符串转换为熊猫数据框

pandas 如何重新索引多索引数据帧

相关推荐

最近更新

标签