pandas 在熊猫中按范围加入/合并的最佳方式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44367672/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:44:12  来源:igfitidea点击:

Best way to join / merge by range in pandas

pythonpandasnumpyjoin

提问by Dimgold

I'm frequently using pandas for merge (join) by using a range condition.

我经常使用范围条件将Pandas用于合并(连接)。

For instance if there are 2 dataframes:

例如,如果有 2 个数据帧:

A(A_id, A_value)

A(A_id, A_value)

B(B_id,B_low, B_high, B_name)

B(B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

它们很大并且大小大致相同(假设每个有 2M 条记录)。

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

我想在 A 和 B 之间进行内部连接,所以 A_value 将在 B_low 和 B_high 之间。

Using SQL syntax that would be:

使用 SQL 语法将是:

SELECT *
FROM A,B
WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

这将非常简单、简短且高效。

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

同时,在 Pandas 中,唯一的方法(不使用我发现的循环)是在两个表中创建一个虚拟列,对其进行连接(相当于交叉连接),然后过滤掉不需要的行。这听起来沉重而复杂:

A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)]mask, but it sounds inefficient as well and might require index optimization.

我的另一个解决方案是通过使用B[(x>=B.B_low) & (x<=B.B_high)]掩码在 B 上的每个 A 值上应用搜索函数,但这听起来也效率低下,可能需要索引优化。

Is there a more elegant and/or efficient way to perform this action?

是否有更优雅和/或更有效的方式来执行此操作?

回答by piRSquared

Setup
Consider the dataframes Aand B

设置
考虑数据帧AB

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84


numpy
The ?easiest?way is to use numpybroadcasting.
We look for every instance of A_valuebeing greater than or equal to B_lowwhile at the same time A_valueis less than or equal to B_high.

numpy
?最简单的?方法是使用numpy广播。
我们找的每一个实例A_value大于或等于B_low,而在同一时间A_value小于或等于B_high

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30


To address the comments and give something akin to a left join, I appended the part of Athat doesn't match.

为了解决评论并给出类似于左连接的内容,我附加了A不匹配的部分。

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

回答by Adonis

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

不确定这是否更有效,但是您可以将 sql 直接(例如来自模块 sqlite3)与 Pandas(受此问题的启发)一起使用,例如:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

您可以根据应用程序的需要调整查询

回答by baloo

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

我不知道它的效率有多高,但有人写了一个包装器,允许您对 Pandas 对象使用 SQL 语法。这就是所谓的pandasql。该文档明确指出支持连接。这可能至少更容易阅读,因为 SQL 语法非常易读。

回答by suvy

lets take a simple example:

让我们举一个简单的例子:

df=pd.DataFrame([2,3,4,5,6],columns=['A'])

returns

回报

    A
0   2
1   3
2   4
3   5
4   6

now lets define a second dataframe

现在让我们定义第二个数据框

df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]

results in

结果是

    B_low   B_high
0   1       2
1   6       8
2   2       4
3   3       6
4   5       6

here we go; and we want output to be index 3 and A value 5

开始了; 我们希望输出是索引 3 和 A 值 5

df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()

results in

结果是

    A
3   5.0

回答by Akshay Kandul

Consider that your A dataframe is

考虑到您的 A 数据框是

A = pd.DataFrame([[0,2],[1,3],[2,4],[3,5],[4,6]],columns=['A_id', 'A_value'])

and B dataframe is

和 B 数据帧是

B = pd.DataFrame([[0,1,2,'a'],[1,4,9,'b'],[2,2,5,'c'],[3,6,7,'d'],[4,8,9,'e']],columns=['B_id', 'B_low', 'B_high', 'B_name'])

using this below you will get the desired output

在下面使用它,您将获得所需的输出

A = A[(A['A_value']>=B['B_low'])&(A['A_value']<=B['B_high'])]