pandas 根据日期范围合并数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31328014/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merging dataframes based on date range
提问by Nicholas Tulach
I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) with a Date. I want to merge df1and df2based on df2.Date between df1.StartDateand df2.EndDate.
我有两个 Pandas 数据框:一个 ( df1) 带有三列 ( StartDate, EndDate, 和ID),第二个 ( df2) 带有日期。我想合并df1并df2基于 df2.Datedf1.StartDate和之间df2.EndDate。
Each date range in df1is unique and doesn't overlap with any of the other rows in the dataframe.
中的每个日期范围df1都是唯一的,不会与数据框中的任何其他行重叠。
Dates are formatted YYYY-MM-DD.
日期被格式化YYYY-MM-DD。
回答by Jianxun Li
Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.
只是为了提供另一种使用np.piecewise. 性能甚至比np.searchedsort.
import pandas as pd
import numpy as np
# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
df1.StartDate
Out[139]:
0 2010-01-01
1 2010-01-06
2 2010-01-11
3 2010-01-16
4 2010-01-21
5 2010-01-26
6 2010-01-31
7 2010-02-05
8 2010-02-10
Name: StartDate, dtype: datetime64[ns]
df2.date_time
Out[140]:
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
4 2010-01-05
5 2010-01-06
6 2010-01-07
7 2010-01-08
8 2010-01-09
9 2010-01-10
...
40 2010-02-10
41 2010-02-11
42 2010-02-12
43 2010-02-13
44 2010-02-14
45 2010-02-15
46 2010-02-16
47 2010-02-17
48 2010-02-18
49 2010-02-19
Name: date_time, dtype: datetime64[ns]
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
Out[143]:
date_time values ID_matched
0 2010-01-01 -0.2240 1
1 2010-01-02 -0.4202 1
2 2010-01-03 0.9998 1
3 2010-01-04 0.4310 1
4 2010-01-05 -0.6509 0
5 2010-01-06 -1.4987 2
6 2010-01-07 -1.2306 2
7 2010-01-08 0.1940 2
8 2010-01-09 -0.9984 2
9 2010-01-10 -0.3676 0
.. ... ... ...
40 2010-02-10 0.5242 9
41 2010-02-11 0.3451 9
42 2010-02-12 0.7244 9
43 2010-02-13 -2.0404 9
44 2010-02-14 -1.0798 0
45 2010-02-15 -0.6934 0
46 2010-02-16 -2.3380 0
47 2010-02-17 1.6623 0
48 2010-02-18 -0.2754 0
49 2010-02-19 -0.7466 0
[50 rows x 3 columns]
%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 μs per loop
回答by Mark_Anderson
Minor correction to @JianxunLi answer. Bit too involved for a comment.
对@JianxunLi 的回答稍作修正。有点太参与评论了。
This uses the len(funclist) == len(condlist) + 1property of piecewiseto assign a default value for when there is no match. Otherwise the default no-match value is zero, which can cause problems...
这使用 的len(funclist) == len(condlist) + 1属性piecewise为不匹配时分配默认值。否则默认的不匹配值为零,这可能会导致问题......
### Data / inits
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})
df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))
### Processing
valIfNoMatch = np.nan
df2['ID_matched'] = np.piecewise(np.zeros(len(df2)),\
[(df2.date_time.values >= start_date)&(df2.date_time.values < end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)],\
np.append(df1.ID.values, valIfNoMatch))
PS. Also corrected the typo testing both >=& <=; a timestamp on an exact boundary between intervals would return true for two different intervals, which breaks a key assumption of the method.
附注。还更正了>=& <=; 间隔之间精确边界上的时间戳将对两个不同的间隔返回 true,这打破了该方法的一个关键假设。

