pandas 在熊猫中按范围加入/合并的最佳方式

Question

提问by Dimgold

I'm frequently using pandas for merge (join) by using a range condition.

我经常使用范围条件将Pandas用于合并（连接）。

For instance if there are 2 dataframes:

例如，如果有 2 个数据帧：

A(A_id, A_value)

B(B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

它们很大并且大小大致相同（假设每个有 2M 条记录）。

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

我想在 A 和 B 之间进行内部连接，所以 A_value 将在 B_low 和 B_high 之间。

Using SQL syntax that would be:

使用 SQL 语法将是：

SELECT *
FROM A,B
WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

这将非常简单、简短且高效。

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

同时，在 Pandas 中，唯一的方法（不使用我发现的循环）是在两个表中创建一个虚拟列，对其进行连接（相当于交叉连接），然后过滤掉不需要的行。这听起来沉重而复杂：

A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)]mask, but it sounds inefficient as well and might require index optimization.

我的另一个解决方案是通过使用B[(x>=B.B_low) & (x<=B.B_high)]掩码在 B 上的每个 A 值上应用搜索函数，但这听起来也效率低下，可能需要索引优化。

Is there a more elegant and/or efficient way to perform this action?

是否有更优雅和/或更有效的方式来执行此操作？

Answer 1

回答by piRSquared

Setup
Consider the dataframes Aand B

设置
考虑数据帧A和B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ?easiest?way is to use numpybroadcasting.
We look for every instance of A_valuebeing greater than or equal to B_lowwhile at the same time A_valueis less than or equal to B_high.

numpy
的？最简单的？方法是使用numpy广播。
我们找的每一个实例A_value大于或等于B_low，而在同一时间A_value小于或等于B_high。

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of Athat doesn't match.

为了解决评论并给出类似于左连接的内容，我附加了A不匹配的部分。

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

Answer 2

回答by Adonis

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

不确定这是否更有效，但是您可以将 sql 直接（例如来自模块 sqlite3）与 Pandas（受此问题的启发）一起使用，例如：

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

您可以根据应用程序的需要调整查询

Answer 3

回答by baloo

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

我不知道它的效率有多高，但有人写了一个包装器，允许您对 Pandas 对象使用 SQL 语法。这就是所谓的pandasql。该文档明确指出支持连接。这可能至少更容易阅读，因为 SQL 语法非常易读。

Answer 4

回答by suvy

lets take a simple example:

让我们举一个简单的例子：

df=pd.DataFrame([2,3,4,5,6],columns=['A'])

returns

回报

now lets define a second dataframe

现在让我们定义第二个数据框

df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]

results in

结果是

    B_low   B_high
0   1       2
1   6       8
2   2       4
3   3       6
4   5       6

here we go; and we want output to be index 3 and A value 5

开始了; 我们希望输出是索引 3 和 A 值 5

df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()

results in

结果是

    A
3   5.0

Answer 5

回答by Akshay Kandul

Consider that your A dataframe is

考虑到您的 A 数据框是

A = pd.DataFrame([[0,2],[1,3],[2,4],[3,5],[4,6]],columns=['A_id', 'A_value'])

and B dataframe is

和 B 数据帧是

B = pd.DataFrame([[0,1,2,'a'],[1,4,9,'b'],[2,2,5,'c'],[3,6,7,'d'],[4,8,9,'e']],columns=['B_id', 'B_low', 'B_high', 'B_name'])

using this below you will get the desired output

在下面使用它，您将获得所需的输出

A = A[(A['A_value']>=B['B_low'])&(A['A_value']<=B['B_high'])]

pandas 在熊猫中按范围加入/合并的最佳方式

提问by Dimgold

回答by piRSquared

回答by Adonis

回答by baloo

回答by suvy

回答by Akshay Kandul

相关推荐

最近更新

标签

pandas 在熊猫中按范围加入/合并的最佳方式

提问by Dimgold

回答by piRSquared

回答by Adonis

回答by baloo

回答by suvy

回答by Akshay Kandul

相关推荐

Python - 使用 Pandas 格式化 Excel 单元格

pandas 将系列添加到现有 DataFrame

如何将文本文件加载到 Pandas 数据框中？

pandas 错误：“cat”未被识别为内部或外部命令

相关推荐

最近更新

标签