pandas 在熊猫中按范围加入/合并的最佳方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44367672/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to join / merge by range in pandas
提问by Dimgold
I'm frequently using pandas for merge (join) by using a range condition.
我经常使用范围条件将Pandas用于合并(连接)。
For instance if there are 2 dataframes:
例如,如果有 2 个数据帧:
A(A_id, A_value)
A(A_id, A_value)
B(B_id,B_low, B_high, B_name)
B(B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
它们很大并且大小大致相同(假设每个有 2M 条记录)。
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
我想在 A 和 B 之间进行内部连接,所以 A_value 将在 B_low 和 B_high 之间。
Using SQL syntax that would be:
使用 SQL 语法将是:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
这将非常简单、简短且高效。
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
同时,在 Pandas 中,唯一的方法(不使用我发现的循环)是在两个表中创建一个虚拟列,对其进行连接(相当于交叉连接),然后过滤掉不需要的行。这听起来沉重而复杂:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)]
mask, but it sounds inefficient as well and might require index optimization.
我的另一个解决方案是通过使用B[(x>=B.B_low) & (x<=B.B_high)]
掩码在 B 上的每个 A 值上应用搜索函数,但这听起来也效率低下,可能需要索引优化。
Is there a more elegant and/or efficient way to perform this action?
是否有更优雅和/或更有效的方式来执行此操作?
回答by piRSquared
Setup
Consider the dataframes A
and B
设置
考虑数据帧A
和B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ?easiest?way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
numpy
的?最简单的?方法是使用numpy
广播。
我们找的每一个实例A_value
大于或等于B_low
,而在同一时间A_value
小于或等于B_high
。
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([A.values[i], B.values[j]]),
columns=A.columns.append(B.columns)
)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
为了解决评论并给出类似于左连接的内容,我附加了A
不匹配的部分。
pd.DataFrame(
np.column_stack([A.values[i], B.values[j]]),
columns=A.columns.append(B.columns)
).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
回答by Adonis
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
不确定这是否更有效,但是您可以将 sql 直接(例如来自模块 sqlite3)与 Pandas(受此问题的启发)一起使用,例如:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
您可以根据应用程序的需要调整查询
回答by baloo
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
我不知道它的效率有多高,但有人写了一个包装器,允许您对 Pandas 对象使用 SQL 语法。这就是所谓的pandasql。该文档明确指出支持连接。这可能至少更容易阅读,因为 SQL 语法非常易读。
回答by suvy
lets take a simple example:
让我们举一个简单的例子:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
回报
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
现在让我们定义第二个数据框
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
结果是
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
开始了; 我们希望输出是索引 3 和 A 值 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
结果是
A
3 5.0
回答by Akshay Kandul
Consider that your A dataframe is
考虑到您的 A 数据框是
A = pd.DataFrame([[0,2],[1,3],[2,4],[3,5],[4,6]],columns=['A_id', 'A_value'])
and B dataframe is
和 B 数据帧是
B = pd.DataFrame([[0,1,2,'a'],[1,4,9,'b'],[2,2,5,'c'],[3,6,7,'d'],[4,8,9,'e']],columns=['B_id', 'B_low', 'B_high', 'B_name'])
using this below you will get the desired output
在下面使用它,您将获得所需的输出
A = A[(A['A_value']>=B['B_low'])&(A['A_value']<=B['B_high'])]