Python 合并一个值介于另外两个值之间的 Pandas 数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30627968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:43:47  来源:igfitidea点击:

Merge pandas dataframes where one value is between two others

pythonpandasjointimespandate-range

提问by itzy

I need to merge two pandas dataframes on an identifier and a condition where a date in one dataframe is between two dates in the other dataframe.

我需要在一个标识符和一个条件上合并两个 Pandas 数据帧,其中一个数据帧中的日期在另一个数据帧中的两个日期之间。

Dataframe A has a date ("fdate") and an ID ("cusip"):

数据框 A 有一个日期(“fdate”)和一个 ID(“cusip”):

enter image description here

在此处输入图片说明

I need to merge this with this dataframe B:

我需要将此与此数据框 B 合并:

enter image description here

在此处输入图片说明

on A.cusip==B.ncusipand A.fdateis between B.namedtand B.nameenddt.

A.cusip==B.ncusipA.fdate之间B.namedtB.nameenddt

In SQL this would be trivial, but the only way I can see how to do this in pandas is to first merge unconditionally on the identifier, and then filter on the date condition:

在 SQL 中,这将是微不足道的,但我可以看到如何在 Pandas 中执行此操作的唯一方法是首先无条件合并标识符,然后过滤日期条件:

df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip')
df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])]

Is this really the best way to do this? It seems that it would be much better if one could filter within the merge so as to avoid having a potentially very large dataframe after the merge but before the filter has completed.

这真的是最好的方法吗?如果可以在合并中进行过滤,以避免在合并之后但在过滤器完成之前出现潜在的非常大的数据帧,似乎会好得多。

采纳答案by ChuHo

As you say, this is pretty easy in SQL, so why not do it in SQL?

正如您所说,这在 SQL 中很容易,那么为什么不在 SQL 中进行呢?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

df:

df:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

回答by firelynx

There is no pandamic way of doing this at the moment.

目前没有流行的方式来做到这一点。

This answer used to be about tackling the problem with polymorphism, which tured out to be a very bad idea.

这个答案曾经是关于用多态解决问题,这被证明是一个非常糟糕的主意

Then the numpy.piecewisefunction appeared in another answer, but with little explanation, so I thought I would clarify how this function can be used.

然后该numpy.piecewise函数出现在另一个答案中,但几乎没有解释,所以我想我会澄清如何使用该函数。

Numpy way with piecewise (Memory heavy)

Numpy方式与分段(内存沉重)

The np.piecewisefunction can be used to generate the behavior of a custom join. There is a lot of overhead involved and it is not very efficient perse, but it does the job.

np.piecewise函数可用于生成自定义联接的行为。涉及很多开销,本身效率不是很高,但它可以完成工作。

Producing conditions for joining

生产加盟条件

import pandas as pd
from datetime import datetime


presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})

start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
              (war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]

> conditions
[array([ True,  True], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool)]

This is a list of arrays where each array tells us if the term time span matched for each of the two war declarations we have. The conditions can explode with larger datasetsas it will be the length of the left df and the right df multiplied.

这是一个数组列表,其中每个数组告诉我们术语时间跨度是否与我们拥有的两个War声明中的每一个都匹配。条件可能会随着更大的数据集爆炸,因为它将是左 df 和右 df 的长度相乘。

The piecewise "magic"

分段的“魔法”

Now piecewise will take the president_idfrom the terms and place it in the war_declarationsdataframe for each of the corresponding wars.

现在分段将从president_id术语中取出并将其放置在war_declarations每个相应War的数据框中。

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
                                                conditions,
                                                terms.president_id.values)
    date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

Now to finish this example we just need to regularly merge in the presidents' name.

现在要完成这个例子,我们只需要定期合并总统的名字。

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])

    date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

Polymorphism (does not work)

多态(不起作用)

I wanted to share my research efforts, so even if this does not solve the problem, I hope it will be allowed to live on here as a useful replyat least. Since it is hard to spot the error, someone else may try this and think they have a working solution, while in fact, they don't.

我想分享我的研究成果,所以即使这不能解决问题,我希望它至少可以作为有用的回复留在这里。由于很难发现错误,其他人可能会尝试这样做并认为他们有一个可行的解决方案,而实际上,他们没有。

The only other way I could figure out is to create two new classes, one PointInTime and one Timespan

我能想到的唯一另一种方法是创建两个新类,一个 PointInTime 和一个 Timespan

Both should have __eq__methods where they return true if a PointInTime is compared to a Timespan which contains it.

__eq__如果将 PointInTime 与包含它的 Timespan 进行比较,两者都应该有返回 true 的方法。

After that you can fill your DataFrame with these objects, and join on the columns they live in.

之后,您可以用这些对象填充 DataFrame,并加入它们所在的列。

Something like this:

像这样的东西:

class PointInTime(object):

    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        return other.start_date < self.dt < other.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __eq__(self, other):
        return self.start_date < other.dt < self.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
                                             self.end_date.year, self.end_date.month, self.end_date.day)

Important note: I do not subclass datetime because pandas will consider the dtype of the column of datetime objects to be a datetime dtype, and since the timespan is not, pandas silently refuses to merge on them.

重要说明:我不会对 datetime 进行子类化,因为 pandas 会将 datetime 对象列的 dtype 视为 datetime dtype,并且由于时间跨度不是,pandas 会默默地拒绝对其进行合并。

If we instantiate two objects of these classes, they can now be compared:

如果我们实例化这些类的两个对象,现在可以比较它们:

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

We can also fill two DataFrames with these objects:

我们还可以用这些对象填充两个 DataFrame:

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})

df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

And then the merging kind of works:

然后是合并类型的作品:

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')

        pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

But only kind of.

但只是一种。

PointInTime(2015,3,3)should also have been included in this join on Timespan(datetime(2015,2,1), datetime(2015,4,1))

PointInTime(2015,3,3)也应该被包含在这个加入中 Timespan(datetime(2015,2,1), datetime(2015,4,1))

But it is not.

但事实并非如此。

I figure pandas compares PointInTime(2015,3,3)to PointInTime(2015,2,2)and makes the assumption that since they are not equal, PointInTime(2015,3,3)cannot be equal to Timespan(datetime(2015,2,1), datetime(2015,4,1)), since this timespan was equal to PointInTime(2015,2,2)

我身材比较大熊猫PointInTime(2015,3,3)PointInTime(2015,2,2),使假设,因为它们不相等,PointInTime(2015,3,3)不能等于Timespan(datetime(2015,2,1), datetime(2015,4,1)),因为这个时间跨度等于PointInTime(2015,2,2)

Sort of like this:

有点像这样:

Rose == Flower
Lilly != Rose

Therefore:

所以:

Lilly != Flower

Edit:

编辑:

I tried to make all PointInTime equal to each other, this changed the behaviour of the join to include the 2015-3-3, but the 2015-2-2 was only included for the Timespan 2015-2-1 -> 2015-2-5, so this strengthens my above hypothesis.

我试图让所有 PointInTime 彼此相等,这改变了连接的行为以包含 2015-3-3,但 2015-2-2 仅包含在时间跨度 2015-2-1 -> 2015-2 中-5,所以这加强了我的上述假设。

If anyone has any other ideas, please comment and I can try it.

如果有人有任何其他想法,请发表评论,我可以尝试一下。

回答by Karthik Arumugham

A pandas solution would be great if implemented similar to foverlaps() from data.table package in R. So far I've found numpy's piecewise() to be efficient. I've provided the code based on an earlier discussion Merging dataframes based on date range

如果从 R 中的 data.table 包中实现类似于 foverlaps() 的 Pandas 解决方案将会很棒。到目前为止,我发现 numpy 的piecewise() 是有效的。我提供的代码基于之前的讨论Merging dataframes based on date range

A['permno'] = np.piecewise(np.zeros(A.count()[0]),
                                 [ (A['cusip'].values == id) & (A['fdate'].values >= start) & (A['fdate'].values <= end) for id, start, end in zip(B['ncusip'].values, B['namedf'].values, B['nameenddt'].values)],
                                 B['permno'].values).astype(int)

回答by chris dorn

You should be able to do this now using the package pandasql

您现在应该可以使用包pandasql执行此操作

import pandasql as ps

sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''

newdf = ps.sqldf(sqlcode,locals())

I think the answer from @ChuHo is good. I believe pandasql is doing the same for you. I haven't benchmarked the two, but it is easier to read.

我认为@ChuHo 的回答很好。我相信 pandasql 正在为您做同样的事情。我没有对两者进行基准测试,但它更容易阅读。