pandas 如何匹配pandas DataFrame中的多列“间隔”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39786406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to match multiple columns in pandas DataFrame for an "interval"?
提问by ShanZhengYang
I have the following pandas DataFrame:
我有以下Pandas数据帧:
import pandas as pd
df = pd.DataFrame('filename.csv')
print(df)
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
So, there is an order
column encompasses many rows each, and then a range/interval from start
to end
for each row. Each row then is labeled by a certain value
(e.g. category1, category2, etc.)
所以,有一个order
列各自涵盖许多行,然后从一个范围/间隔start
来end
为每一行。然后每一行都标有特定的标签value
(例如类别 1、类别 2 等)
Now I have another dataframe called key_df
. It is basically the exact same format:
现在我有另一个名为key_df
. 它基本上是完全相同的格式:
import pandas as pd
key_df = pd.DataFrame(...)
print(key_df)
order start end value
1 1284 1299 category4
1 1297 1309 category9
1 1312 1369 category3
1 1345 1392 category29
1 1371 1383 category31
....
1 1471 1501 category31
...
My goal is to take the key_df
dataframe and check whether the intervals start:end
match any of the rows in the original dataframe df
. If it does, this row in df
should be labeled with the key_df
dataframe's value
value.
我的目标是获取key_df
数据帧并检查间隔是否start:end
与原始数据帧中的任何行匹配df
。如果是,则此行df
应标有key_df
数据框的value
值。
In our example above, the dataframe df
would end up like this:
在我们上面的例子中,数据框df
最终会是这样的:
order start end value key_value
1 1342 1357 category1 category29
1 1459 1489 category7 category31
....
This is because if you look at key_df
, the row
这是因为如果你看一下key_df
,行
1 1345 1392 category29
with interval 1::1345-1392
falls in the interval 1::1342-1357
in the original df
. Likewise, the key_df
row:
with 区间1::1345-1392
落在1::1342-1357
原来的区间内df
。同样,该key_df
行:
1 1471 1501 category31
corresponds to the second row in df
:
对应于中的第二行df
:
1 1459 1489 category7 category31
I'm not entirely sure
我不完全确定
(1) how to accomplish this task in pandas
(1)如何在pandas中完成这个任务
(2) how to scale this efficiently in pandas
(2) 如何在 Pandas 中有效地扩展
One could begin with an if statement, e.g.
可以从 if 语句开始,例如
if df.order == key_df.order:
# now check intervals...somehow
but this doesn't take advantage of the dataframe structure. One then must check by interval, i.e. something like (df.start =< key_df.start) && (df.end => key_df.end)
但这并没有利用数据帧结构。然后必须按间隔检查,即类似(df.start =< key_df.start) && (df.end => key_df.end)
I'm stuck. What is the most efficient way to match multiple columns in an "interval" in pandas? (Creating a new column if this condition is met is then straightforward)
我被困住了。在Pandas的“间隔”中匹配多列的最有效方法是什么?(如果满足此条件,则创建一个新列很简单)
采纳答案by jezrael
You can use merge
with boolean indexing
, but if DataFrames
are large, scaling is problematic:
您可以使用merge
with boolean indexing
,但如果DataFrames
很大,则缩放会出现问题:
df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
order start end value start_key end_key value_key
3 1 1342 1357 category1 1345.0 1392.0 category29
4 1 1342 1357 category1 1371.0 1383.0 category31
5 1 1342 1357 category1 1471.0 1501.0 category31
11 1 1459 1489 category7 1471.0 1501.0 category31
EDIT by comment:
通过评论编辑:
df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
df1 = pd.merge(df, df1, on=['order','start','end', 'value'], how='left')
print (df1)
order start end value start_key end_key value_key
0 1 1342 1357 category1 1345.0 1392.0 category29
1 1 1342 1357 category1 1371.0 1383.0 category31
2 1 1342 1357 category1 1471.0 1501.0 category31
3 1 1459 1489 category7 1471.0 1501.0 category31
4 1 1572 1601 category23 NaN NaN NaN
5 1 1587 1599 category2 NaN NaN NaN
6 1 1591 1639 category1 NaN NaN NaN
7 15 792 813 category13 NaN NaN NaN
8 15 892 913 category5 NaN NaN NaN