Pandas 在名称和最近日期合并
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24614474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Merge on Name and Closest Date
提问by sfortney
I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question herebut it doesn't account for the name as well. From the above question it doesn't seem like there is a way to do this with merge but I can't see another way to do the two argument join that doesn't use the pandas merge function.
我正在尝试合并名称和最近日期的两个数据框(WRT 左侧数据框)。在我的研究中,我在这里发现了一个类似的问题,但它也没有说明名称。从上面的问题来看,似乎没有办法通过合并来做到这一点,但我看不到另一种方法来执行不使用Pandas合并函数的两个参数连接。
Is there a way to do this with merge? And if not what would be the appropriate way to do this?
有没有办法通过合并来做到这一点?如果不是,那么执行此操作的适当方法是什么?
I will post a copy of what I have tried but this was trying it with an exact merge on date which will not work. The most important line is the last one where I make the data3 dataframe.
我将发布我尝试过的内容的副本,但这是尝试在日期上进行精确合并,但这是行不通的。最重要的一行是我制作 data3 数据框的最后一行。
data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)
data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})
list(data2.columns.values)
data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']
print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])
data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")
采纳答案by hernamesbarbara
I'd also love to see the final solution you came up with to know how it shook out in the end.
我也很想看到你想出的最终解决方案,以了解它最终如何动摇。
One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argminto retrieve the date with the smallest time delta.
您可以做的一件事是找到最接近的日期,可能是计算第一个 DataFrame 中的每个日期与第二个 DataFrame 中的日期之间的天数。然后您可以使用np.argmin来检索具有最小时间增量的日期。
For example:
例如:
Setup
设置
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
Data
数据
a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""
b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""
look at data
看数据
df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()
timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()
timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40
Func to find the closest date to a given date
Func 查找与给定日期最接近的日期
def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)
df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date, args=[df2.timepoint])
df1.head(10)
timepoint measure closest days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0 days
1 2014-01-02 29 2014-01-01 1 days
2 2014-01-03 5 2014-01-01 2 days
3 2014-01-04 73 2014-01-01 3 days
4 2014-01-05 40 2014-01-08 3 days
5 2014-01-06 45 2014-01-08 2 days
6 2014-01-07 48 2014-01-08 1 days
7 2014-01-08 2 2014-01-08 0 days
8 2014-01-09 96 2014-01-08 1 days
9 2014-01-10 82 2014-01-08 2 days
Merge the two DataFrames on the new closestdate column
合并新closest日期列上的两个 DataFrame
df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])
colorder = [
'timepoint_x',
'closest',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]
df3 = df3.ix[:, colorder]
df3
timepoint_x closest timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014-01-01 0 days 78 78
1 2014-01-02 2014-01-01 2014-01-01 1 days 29 78
2 2014-01-03 2014-01-01 2014-01-01 2 days 5 78
3 2014-01-04 2014-01-01 2014-01-01 3 days 73 78
4 2014-01-05 2014-01-08 2014-01-08 3 days 40 29
5 2014-01-06 2014-01-08 2014-01-08 2 days 45 29
6 2014-01-07 2014-01-08 2014-01-08 1 days 48 29
7 2014-01-08 2014-01-08 2014-01-08 0 days 2 29
8 2014-01-09 2014-01-08 2014-01-08 1 days 96 29
9 2014-01-10 2014-01-08 2014-01-08 2 days 82 29
10 2014-01-11 2014-01-08 2014-01-08 3 days 61 29
11 2014-01-12 2014-01-15 2014-01-15 3 days 68 5
12 2014-01-13 2014-01-15 2014-01-15 2 days 8 5
13 2014-01-14 2014-01-15 2014-01-15 1 days 94 5
14 2014-01-15 2014-01-15 2014-01-15 0 days 16 5
15 2014-01-16 2014-01-15 2014-01-15 1 days 31 5
16 2014-01-17 2014-01-15 2014-01-15 2 days 10 5
17 2014-01-18 2014-01-15 2014-01-15 3 days 34 5
18 2014-01-19 2014-01-22 2014-01-22 3 days 27 73
19 2014-01-20 2014-01-22 2014-01-22 2 days 58 73
20 2014-01-21 2014-01-22 2014-01-22 1 days 90 73
21 2014-01-22 2014-01-22 2014-01-22 0 days 41 73
22 2014-01-23 2014-01-22 2014-01-22 1 days 97 73
23 2014-01-24 2014-01-22 2014-01-22 2 days 7 73
24 2014-01-25 2014-01-22 2014-01-22 3 days 86 73
25 2014-01-26 2014-01-29 2014-01-29 3 days 62 40
26 2014-01-27 2014-01-29 2014-01-29 2 days 91 40
27 2014-01-28 2014-01-29 2014-01-29 1 days 0 40
28 2014-01-29 2014-01-29 2014-01-29 0 days 73 40
29 2014-01-30 2014-01-29 2014-01-29 1 days 22 40
30 2014-01-31 2014-01-29 2014-01-29 2 days 43 40
31 2014-02-01 2014-01-29 2014-01-29 3 days 87 40
32 2014-02-02 2014-02-05 2014-02-05 3 days 56 45
33 2014-02-03 2014-02-05 2014-02-05 2 days 45 45
34 2014-02-04 2014-02-05 2014-02-05 1 days 25 45
35 2014-02-05 2014-02-05 2014-02-05 0 days 92 45
36 2014-02-06 2014-02-05 2014-02-05 1 days 83 45
37 2014-02-07 2014-02-05 2014-02-05 2 days 13 45
38 2014-02-08 2014-02-05 2014-02-05 3 days 50 45
39 2014-02-09 2014-02-12 2014-02-12 3 days 48 48
40 2014-02-10 2014-02-12 2014-02-12 2 days 78 48
回答by osonuyi
This is super late, but hopefully its helpful for new answer seekers. I answered a similar question here
这太晚了,但希望它对新的答案寻求者有所帮助。我在这里回答了一个类似的问题
with a somewhat new method in pandas:
在 Pandas 中使用一种新方法:
The parameters of interest for you would be direction,tolerance,left_on, and right_on
感兴趣的参数,你会是direction,tolerance,left_on,和right_on
Building off @hernamesbarbara answer & data:
建立@hernamesbarbara 答案和数据:
data
数据
a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""
b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""
solution
解决方案
import pandas as pd
from pandas import read_csv
from io import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df1['timepoint'] = pd.to_datetime(df1['timepoint'])
df2['timepoint'] = pd.to_datetime(df2['timepoint'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['timepoint']
df2.index = df2['timepoint']
# the magic happens below, check the direction and tolerance arguments
# if you want you can make a maximum tolerance on which to merge data
tol = pd.Timedelta('3 day')
df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
output
输出
df3.head()
timepoint_x measure_x timepoint_y measure_y
timepoint
2014-01-01 2014-01-01 78 2014-01-01 78
2014-01-02 2014-01-02 29 2014-01-01 78
2014-01-03 2014-01-03 5 2014-01-01 78
2014-01-04 2014-01-04 73 2014-01-01 78
2014-01-05 2014-01-05 40 2014-01-08 29
回答by dinya
A small addition to the hernamesbarbara's code
hernamesbarbara 代码的一个小补充
def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
"""takes a pd.Timestamp() instance and a pd.Series with dates in it
calcs the delta between `timepoint` and each date in `time_series`
returns the closest date and optionally the number of days in its time delta
Parameters
----------
mode: "abs" (default), "left", "right"
closest datetime by abs, at left, at right
References
----------
.. [1] http://stackoverflow.com/a/25962323/716469
"""
deltas = time_series - timepoint
idx_closest_date = None
if mode == "abs":
idx_closest_date = np.argmin(abs(deltas))
elif mode == "left":
deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
if len(deltas_):
idx_closest_date = np.argmax(deltas_)
elif mode == "right":
deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
if len(deltas_):
idx_closest_date = np.argmin(deltas_)
else:
raise Exception("Mode is incorrect")
if idx_closest_date is not None:
closest_date = time_series.ix[idx_closest_date]
if add_time_delta_column:
closest_delta = deltas[idx_closest_date]
else:
closest_date = pd.NaT
if add_time_delta_column:
closest_delta = pd.Timedelta(pd.NaT)
res = {"closest_date": closest_date}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = closest_delta
idx.append('closest_delta')
return pd.Series(res, index=idx)

