如何在非简单标准上执行 DataFrames 与 Pandas 的内部或外部连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15581829/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:44:02  来源:igfitidea点击:

how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion

pythonsqlnumpypandas

提问by zzzeek

Given two dataframes as below:

给定两个数据框,如下所示:

>>> import pandas as pd

>>> df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
>>> df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])
>>> df_a
   a  b
0  1  4
1  2  5
2  3  6

>>> df_b
   c  d
0  2  7
1  3  8

we would like to produce a SQL-style join of both dataframes using a non-simplistic criteria, let's say "df_b.c > df_a.a". From what I can tell, while merge()is certainly part of the solution, I can't use it directly since it doesn't accept arbitrary expressions for "ON" criteria (unless I'm missing something?).

我们想使用非简单的标准生成两个数据帧的 SQL 样式连接,比如“df_b.c > df_a.a”。据我所知,虽然merge()肯定是解决方案的一部分,但我不能直接使用它,因为它不接受“ON”标准的任意表达式(除非我遗漏了什么?)。

In SQL, the results look like this:

在 SQL 中,结果如下所示:

# inner join
sqlite> select * from df_a join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8

# outer join
sqlite> select * from df_a left outer join df_b on c > a;
1|4|2|7
1|4|3|8
2|5|3|8
3|6||

my current approach for inner join is to produce a cartesian product of df_a and df_b, by adding a column of "1"s to both, then using merge() on the "1"s column, then applying the "c > a" criteria.

我目前的内连接方法是生成 df_a 和 df_b 的笛卡尔积,方法是向两者添加一列“1”,然后在“1”列上使用 merge(),然后应用“c > a”标准。

>>> import numpy as np
>>> df_a['ones'] = np.ones(3)
>>> df_b['ones'] = np.ones(2)
>>> cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
>>> cartesian
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
2  2  5     1  2  7
3  2  5     1  3  8
4  3  6     1  2  7
5  3  6     1  3  8
>>> cartesian[cartesian.c > cartesian.a]
   a  b  ones  c  d
0  1  4     1  2  7
1  1  4     1  3  8
3  2  5     1  3  8

for outer join, I'm not sure of the best way to go, so far I've been playing with getting the inner join, then applying the negation of the criteria to get all the other rows, then trying to edit that "negation" set onto the original, but it doesn't really work.

对于外连接,我不确定最好的方法,到目前为止我一直在尝试获取内连接,然后应用条件的否定来获取所有其他行,然后尝试编辑该“否定” " 设置到原来的,但它并没有真正奏效。

Edit. HYRY answered the specific question here but I needed something more generic and more within the Pandas API, as my join criterion could be anything, not just that one comparison. For outerjoin, first I'm adding an extra index to the "left" side that will maintain itself after I do the inner join:

编辑。HYRY 在这里回答了具体问题,但我需要在 Pandas API 中使用更通用和更多的东西,因为我的加入标准可以是任何东西,而不仅仅是一个比较。对于外连接,首先我在“左”侧添加一个额外的索引,在我进行内连接后将保持自身:

df_a['_left_index'] = df_a.index

then we do the cartesian and get the inner join:

然后我们做笛卡尔并获得内连接:

cartesian = pd.merge(df_a, df_b, left_on='ones', right_on='ones')
innerjoin = cartesian[cartesian.c > cartesian.a]

then I get the additional index ids in "df_a" that we'll need, and get the rows from "df_a":

然后我在“df_a”中获取我们需要的额外索引ID,并从“df_a”中获取行:

remaining_left_ids = set(df_a['_left_index']).\
                    difference(innerjoin['_left_index'])
remaining = df_a.ix[remaining_left_ids]

then we use a straight concat(), which replaces missing columns with "NaN" for left (I thought it wasn't doing this earlier but I guess it does):

然后我们使用直接的 concat(),它用“NaN”替换左边的缺失列(我认为它之前没有这样做,但我猜它确实这样做了):

outerjoin = pd.concat([innerjoin, remaining]).reset_index()

HYRY's idea to do the cartesian on just those cols that we need to compare on is basically the right answer, though in my specific case it might be a little tricky to implement (generalized and all).

HYRY 的想法是只对那些我们需要比较的列进行笛卡尔运算,这基本上是正确的答案,尽管在我的具体情况下,实施起来可能有点棘手(广义和所有)。

questions:

问题:

  1. How would you produce a "join" of df_1 and df_2 on "c > a"? Would you do the same "cartesian product, filter" approach or is there some better way?

  2. How would you produce the "left outer join" of same?

  1. 您将如何在“c > a”上生成 df_1 和 df_2 的“连接”?您会采用相同的“笛卡尔积,过滤器”方法还是有更好的方法?

  2. 你将如何产生相同的“左外连接”?

采纳答案by HYRY

I use the outer method of ufunc to calculate the result, here is the example:

我使用ufunc的外层方法来计算结果,这里是例子:

First, some data:

先说一些数据:

import pandas as pd
import numpy as np
df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}, {"a": 4, "b": 8}, {"a": 1, "b": 7}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}, {"c": 2, "d": 10}])
print "df_a"
print df_a
print "df_b"
print df_b

output:

输出:

df_a
   a  b
0  1  4
1  2  5
2  3  6
3  4  8
4  1  7
df_b
   c   d
0  2   7
1  3   8
2  2  10

Inner join, because this only calculate the cartesian product of c& a, memory useage is less than cartesian product of the whole DataFrame:

内连接,因为这里只计算c&的笛卡尔积a,内存使用量小于整个 DataFrame 的笛卡尔积:

ia, ib = np.where(np.less.outer(df_a.a, df_b.c))
print pd.concat((df_a.take(ia).reset_index(drop=True), 
                 df_b.take(ib).reset_index(drop=True)), axis=1)

output:

输出:

   a  b  c   d
0  1  4  2   7
1  1  4  3   8
2  1  4  2  10
3  2  5  3   8
4  1  7  2   7
5  1  7  3   8
6  1  7  2  10

to calculate the left outer join, use numpy.setdiff1d()to find all the rows of df_athat not in the inner join:

要计算左外连接,请使用numpy.setdiff1d()查找df_a不在内连接中的所有行:

na = np.setdiff1d(np.arange(len(df_a)), ia)
nb = -1 * np.ones_like(na)
oa = np.concatenate((ia, na))
ob = np.concatenate((ib, nb))
print pd.concat([df_a.take(oa).reset_index(drop=True), 
                 df_b.take(ob).reset_index(drop=True)], axis=1)

output:

输出:

   a  b   c   d
0  1  4   2   7
1  1  4   3   8
2  1  4   2  10
3  2  5   3   8
4  1  7   2   7
5  1  7   3   8
6  1  7   2  10
7  3  6 NaN NaN
8  4  8 NaN NaN

回答by jharting

This can be done like this with broadcasting and np.where. Use whatever binary operator you want that evaluates to True/False:

这可以通过广播和 np.where 来完成。使用任何你想要的计算结果为真/假的二元运算符:

import operator as op

df_a = pd.DataFrame([{"a": 1, "b": 4}, {"a": 2, "b": 5}, {"a": 3, "b": 6}])
df_b = pd.DataFrame([{"c": 2, "d": 7}, {"c": 3, "d": 8}])

binOp   = op.lt
matches = np.where(binOp(df_a.a[:,None],df_b.c.values))

print pd.concat([df.ix[idxs].reset_index(drop=True) 
                 for df,idxs in zip([df_a,df_b],matches)],
                axis=1).to_csv()

,a,b,c,d

,A B C D

0,1,4,2,7

0,1,4,2,7

1,1,4,3,8

1,1,4,3,8

2,2,5,3,8

2,2,5,3,8