pandas 如何加入列值在特定范围内的两个数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46525786/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to join two dataframes for which column values are within a certain range?
提问by DougKruger
Given two dataframes df_1
and df_2
, how to join them such that datetime column df_1
is in between start
and end
in dataframe df_2
:
给定两个数据框df_1
和df_2
,如何连接它们以使日期时间列 位于数据框df_1
之间start
和end
中df_2
:
print df_1
timestamp A B
0 2016-05-14 10:54:33 0.020228 0.026572
1 2016-05-14 10:54:34 0.057780 0.175499
2 2016-05-14 10:54:35 0.098808 0.620986
3 2016-05-14 10:54:36 0.158789 1.014819
4 2016-05-14 10:54:39 0.038129 2.384590
print df_2
start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
Get corresponding event
where df1.timestamp
is between df_2.start
and df2.end
获取对应的event
wheredf1.timestamp
介于df_2.start
和之间df2.end
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
采纳答案by Bharath
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
一个简单的解决方案是interval index
从start and end
设置创建closed = both
然后用于get_loc
获取事件,即(希望所有日期时间都在时间戳 dtype 中)
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
输出 :
timestamp A B event 0 2016-05-14 10:54:33 0.020228 0.026572 E1 1 2016-05-14 10:54:34 0.057780 0.175499 E2 2 2016-05-14 10:54:35 0.098808 0.620986 E2 3 2016-05-14 10:54:36 0.158789 1.014819 E2 4 2016-05-14 10:54:39 0.038129 2.384590 E3
回答by cs95
First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.
首先使用 IntervalIndex 根据感兴趣的区间创建参考索引,然后使用 get_indexer 对包含感兴趣的离散事件的数据帧进行切片。
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0 E1
1 E2
1 E2
1 E2
2 E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Reference: A question on IntervalIndex.get_indexer.
回答by chris dorn
回答by YOBEN_S
Option 1
选项1
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
Option 2
选项 2
df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
回答by Tai
In this method, we assume TimeStamp objects are used.
在此方法中,我们假设使用了 TimeStamp 对象。
df2 start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
Explanation of get_event
的解释 get_event
For each timestamp in df1
, say t0 = 2016-05-14 10:54:33
,
对于 中的每个时间戳df1
,比如说t0 = 2016-05-14 10:54:33
,
(t0 >= df2.start) & (t0 <= df2.end)
will contain 1 true. (See example 1). Then, take a dot product with np.arange(event_num)
to get the index of the event that a t0
belongs to.
(t0 >= df2.start) & (t0 <= df2.end)
将包含 1 个 true。(参见示例 1)。然后,取一个点积 withnp.arange(event_num)
得到 at0
所属事件的索引。
Examples:
例子:
Example 1
示例 1
t0 >= df2.start t0 <= df2.end After & np.arange(3)
0 True True -> T 0 event_idx
1 False True -> F 1 -> 0
2 False True -> F 2
Take t2 = 2016-05-14 10:54:35
for another example
以t2 = 2016-05-14 10:54:35
另一个例子
t2 >= df2.start t2 <= df2.end After & np.arange(3)
0 True False -> F 0 event_idx
1 True True -> T 1 -> 1
2 False True -> F 2
We finally use transform
to transform each timestamp into an event.
我们最终使用transform
将每个时间戳转换为一个事件。