Pandas 在名称和最近日期合并

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24614474/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:14:12  来源:igfitidea点击:

Pandas Merge on Name and Closest Date

pythondatepandasmerge

提问by sfortney

I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question herebut it doesn't account for the name as well. From the above question it doesn't seem like there is a way to do this with merge but I can't see another way to do the two argument join that doesn't use the pandas merge function.

我正在尝试合并名称和最近日期的两个数据框(WRT 左侧数据框)。在我的研究中,我在这里发现了一个类似的问题但它也没有说明名称。从上面的问题来看,似乎没有办法通过合并来做到这一点,但我看不到另一种方法来执行不使用Pandas合并函数的两个参数连接。

Is there a way to do this with merge? And if not what would be the appropriate way to do this?

有没有办法通过合并来做到这一点?如果不是,那么执行此操作的适当方法是什么?

I will post a copy of what I have tried but this was trying it with an exact merge on date which will not work. The most important line is the last one where I make the data3 dataframe.

我将发布我尝试过的内容的副本,但这是尝试在日期上进行精确合并,但这是行不通的。最重要的一行是我制作 data3 数据框的最后一行。

data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)

data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})

list(data2.columns.values)

data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']

print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])

data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")

采纳答案by hernamesbarbara

I'd also love to see the final solution you came up with to know how it shook out in the end.

我也很想看到你想出的最终解决方案,以了解它最终如何动摇。

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argminto retrieve the date with the smallest time delta.

您可以做的一件事是找到最接近的日期,可能是计算第一个 DataFrame 中的每个日期与第二个 DataFrame 中的日期之间的天数。然后您可以使用np.argmin来检索具有最小时间增量的日期。

For example:

例如:

Setup

设置

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

数据

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

看数据

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func to find the closest date to a given date

Func 查找与给定日期最接近的日期

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

Merge the two DataFrames on the new closestdate column

合并新closest日期列上的两个 DataFrame

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48

回答by osonuyi

This is super late, but hopefully its helpful for new answer seekers. I answered a similar question here

这太晚了,但希望它对新的答案寻求者有所帮助。我在这里回答了一个类似的问题

with a somewhat new method in pandas:

在 Pandas 中使用一种新方法:

pandas.merge_asof()

pandas.merge_asof()

The parameters of interest for you would be direction,tolerance,left_on, and right_on

感兴趣的参数,你会是directiontoleranceleft_on,和right_on

Building off @hernamesbarbara answer & data:

建立@hernamesbarbara 答案和数据:

data

数据

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

solution

解决方案

import pandas as pd
from pandas import read_csv
from io import StringIO

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])



df1['timepoint'] = pd.to_datetime(df1['timepoint'])
df2['timepoint'] = pd.to_datetime(df2['timepoint'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['timepoint']
df2.index = df2['timepoint']
# the magic happens below, check the direction and tolerance arguments
# if you want you can make a maximum tolerance on which to merge data
tol = pd.Timedelta('3 day')
df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

输出

df3.head()

    timepoint_x measure_x   timepoint_y measure_y
timepoint               
2014-01-01  2014-01-01  78  2014-01-01  78
2014-01-02  2014-01-02  29  2014-01-01  78
2014-01-03  2014-01-03  5   2014-01-01  78
2014-01-04  2014-01-04  73  2014-01-01  78
2014-01-05  2014-01-05  40  2014-01-08  29

回答by dinya

A small addition to the hernamesbarbara's code

hernamesbarbara 代码的一个小补充

def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
    """takes a pd.Timestamp() instance and a pd.Series with dates in it
    calcs the delta between `timepoint` and each date in `time_series`
    returns the closest date and optionally the number of days in its time delta

    Parameters
    ----------
    mode: "abs" (default), "left", "right"
        closest datetime by abs, at left, at right

    References
    ----------
    .. [1] http://stackoverflow.com/a/25962323/716469
    """
    deltas = time_series - timepoint

    idx_closest_date = None
    if mode == "abs":
        idx_closest_date = np.argmin(abs(deltas))
    elif mode == "left":
        deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmax(deltas_)
    elif mode == "right":
        deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmin(deltas_)
    else:
        raise Exception("Mode is incorrect")

    if idx_closest_date is not None:
        closest_date = time_series.ix[idx_closest_date]
        if add_time_delta_column:
            closest_delta = deltas[idx_closest_date]
    else:
        closest_date = pd.NaT
        if add_time_delta_column:
            closest_delta = pd.Timedelta(pd.NaT)

    res = {"closest_date": closest_date}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = closest_delta
        idx.append('closest_delta')

    return pd.Series(res, index=idx)