pandas 如何匹配pandas DataFrame中的多列“间隔”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39786406/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:06:50  来源:igfitidea点击:

How to match multiple columns in pandas DataFrame for an "interval"?

pythonpandasdataframematchintervals

提问by ShanZhengYang

I have the following pandas DataFrame:

我有以下Pandas数据帧:

import pandas as pd
df = pd.DataFrame('filename.csv')
print(df)

order    start    end    value    
1        1342    1357    category1
1        1459    1489    category7
1        1572    1601    category23
1        1587    1599    category2
1        1591    1639    category1
....
15        792     813    category13
15        892     913    category5
....

So, there is an ordercolumn encompasses many rows each, and then a range/interval from startto endfor each row. Each row then is labeled by a certain value(e.g. category1, category2, etc.)

所以,有一个order列各自涵盖许多行,然后从一个范围/间隔startend为每一行。然后每一行都标有特定的标签value(例如类别 1、类别 2 等)

Now I have another dataframe called key_df. It is basically the exact same format:

现在我有另一个名为key_df. 它基本上是完全相同的格式:

import pandas as pd
key_df = pd.DataFrame(...)
print(key_df)

order    start    end    value    
1        1284    1299    category4
1        1297    1309    category9
1        1312    1369    category3
1        1345    1392    category29
1        1371    1383    category31
....
1        1471    1501    category31
...

My goal is to take the key_dfdataframe and check whether the intervals start:endmatch any of the rows in the original dataframe df. If it does, this row in dfshould be labeled with the key_dfdataframe's valuevalue.

我的目标是获取key_df数据帧并检查间隔是否start:end与原始数据帧中的任何行匹配df。如果是,则此行df应标有key_df数据框的value值。

In our example above, the dataframe dfwould end up like this:

在我们上面的例子中,数据框df最终会是这样的:

order    start    end    value        key_value
1        1342    1357    category1    category29
1        1459    1489    category7    category31
....

This is because if you look at key_df, the row

这是因为如果你看一下key_df,行

1        1345    1392    category29

with interval 1::1345-1392falls in the interval 1::1342-1357in the original df. Likewise, the key_dfrow:

with 区间1::1345-1392落在1::1342-1357原来的区间内df。同样,该key_df行:

1        1471    1501    category31

corresponds to the second row in df:

对应于中的第二行df

1        1459    1489    category7    category31

I'm not entirely sure

我不完全确定

(1) how to accomplish this task in pandas

(1)如何在pandas中完成这个任务

(2) how to scale this efficiently in pandas

(2) 如何在 Pandas 中有效地扩展

One could begin with an if statement, e.g.

可以从 if 语句开始,例如

if df.order == key_df.order:
    # now check intervals...somehow

but this doesn't take advantage of the dataframe structure. One then must check by interval, i.e. something like (df.start =< key_df.start) && (df.end => key_df.end)

但这并没有利用数据帧结构。然后必须按间隔检查,即类似(df.start =< key_df.start) && (df.end => key_df.end)

I'm stuck. What is the most efficient way to match multiple columns in an "interval" in pandas? (Creating a new column if this condition is met is then straightforward)

我被困住了。在Pandas的“间隔”中匹配多列的最有效方法是什么?(如果满足此条件,则创建一个新列很简单)

采纳答案by jezrael

You can use mergewith boolean indexing, but if DataFramesare large, scaling is problematic:

您可以使用mergewith boolean indexing,但如果DataFrames很大,则缩放会出现问题:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
print (df1)
    order  start   end      value  start_key  end_key   value_key
3       1   1342  1357  category1     1345.0   1392.0  category29
4       1   1342  1357  category1     1371.0   1383.0  category31
5       1   1342  1357  category1     1471.0   1501.0  category31
11      1   1459  1489  category7     1471.0   1501.0  category31

EDIT by comment:

通过评论编辑:

df1 = pd.merge(df, key_df, on='order', how='outer', suffixes=('','_key'))
df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
df1 = pd.merge(df, df1, on=['order','start','end', 'value'], how='left')
print (df1)
   order  start   end       value  start_key  end_key   value_key
0      1   1342  1357   category1     1345.0   1392.0  category29
1      1   1342  1357   category1     1371.0   1383.0  category31
2      1   1342  1357   category1     1471.0   1501.0  category31
3      1   1459  1489   category7     1471.0   1501.0  category31
4      1   1572  1601  category23        NaN      NaN         NaN
5      1   1587  1599   category2        NaN      NaN         NaN
6      1   1591  1639   category1        NaN      NaN         NaN
7     15    792   813  category13        NaN      NaN         NaN
8     15    892   913   category5        NaN      NaN         NaN