带有 Pandas 的高性能笛卡尔积(CROSS JOIN)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53699012/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:12:19  来源:igfitidea点击:

Performant cartesian product (CROSS JOIN) with pandas

pythonpandasnumpydataframemerge

提问by cs95

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.

这篇文章的内容最初是作为Pandas Merging 101的一部分 ,但由于完全公正地处理这个主题所需的内容的性质和大小,它已移至自己的 QnA。

Given two simple DataFrames;

给定两个简单的 DataFrame;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

The cross product of these frames can be computed, and will look something like:

可以计算这些帧的叉积,如下所示:

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

What is the most performant method of computing this result?

计算此结果的最高效方法是什么?

回答by cs95

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

让我们从建立一个基准开始。解决此问题的最简单方法是使用临时“键”列:

def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). mergethen performs a many-to-many JOIN on "key".

其工作原理是为两个 DataFrame 分配一个具有相同值(例如 1)的临时“键”列。merge然后对“键”执行多对多 JOIN。

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

虽然多对多 JOIN 技巧适用于合理大小的数据帧,但您会看到在较大数据上的性能相对较低。

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

更快的实现将需要 NumPy。以下是1D 笛卡尔积的一些著名NumPy 实现。我们可以建立在其中一些高性能解决方案的基础上,以获得我们想要的输出。然而,我最喜欢的是@senderle 的第一个实现。

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  

Generalizing: CROSS JOIN on Unique orNon-Unique Indexed DataFrames

概括:唯一或非唯一索引数据帧上的交叉连接

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

免责声明
这些解决方案针对具有非混合标量 dtypes 的 DataFrame 进行了优化。如果处理混合 dtypes,请自担风险!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

这个技巧适用于任何类型的 DataFrame。我们使用前面提到的 计算 DataFrames 数字索引的笛卡尔积cartesian_product,使用它来重新索引 DataFrames,和

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

And, along similar lines,

而且,沿着类似的路线,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']


left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

此解决方案可以推广到多个 DataFrame。例如,

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

Further Simplification

进一步简化

A simpler solution not involving @senderle's cartesian_productis possible when dealing with just twoDataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

cartesian_product处理两个DataFrame时,可以使用不涉及 @senderle 的更简单的解决方案。使用np.broadcast_arrays,我们可以达到几乎相同的性能水平。

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True


Performance Comparison

性能比较

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

在具有唯一索引的一些人为数据帧上对这些解决方案进行基准测试,我们有

enter image description here

在此处输入图片说明

Do note that timings may vary based on your setup, data, and choice of cartesian_producthelper function as applicable.

请注意,时间可能会因您的设置、数据和cartesian_product辅助函数的选择而异(如适用)。

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

性能基准代码
这是计时脚本。这里调用的所有函数都在上面定义。

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

回答by YOBEN_S

Using itertoolsproductand recreate the value in dataframe

使用itertoolsproduct和重新创建数据帧中的值

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

回答by Bharath

Here's an approach with triple concat

这是三重的方法 concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

    col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50

enter image description here

在此处输入图片说明