如何在 python Pandas 中进行条件连接?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23508351/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to do a conditional join in python Pandas?
提问by JAB
I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.
我正在尝试根据存储在单独表中的日期值计算 Pandas 中基于时间的聚合。
The top of the first table table_a looks like this:
第一个表 table_a 的顶部如下所示:
COMPANY_ID DATE MEASURE
1 2010-01-01 00:00:00 10
1 2010-01-02 00:00:00 10
1 2010-01-03 00:00:00 10
1 2010-01-04 00:00:00 10
1 2010-01-05 00:00:00 10
Here is the code to create the table:
下面是创建表的代码:
table_a = pd.concat(\
[pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 1 , 'MEASURE': 10}),\
pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
'COMPANY_ID': 2 , 'MEASURE': 10})])
The second table, table_b looks like this:
第二个表 table_b 如下所示:
COMPANY END_DATE
1 2010-03-01 00:00:00
1 2010-06-02 00:00:00
2 2010-03-01 00:00:00
2 2010-06-02 00:00:00
and the code to create it is:
创建它的代码是:
table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
'COMPANY':(1,1,2,2)})
I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.
我希望能够获得 table_b 中 END_DATE 之前每 30 天期间每个 COMPANY_ID 的度量列的总和。
This is (I think) the SQL equivalent:
这是(我认为)SQL 等效项:
select
b.COMPANY_ID,
b.DATE
sum(a.MEASURE) AS MEASURE_TO_END_DATE
from table_a a, table_b b
where a.COMPANY = b.COMPANY and
a.DATE < b.DATE and
a.DATE > b.DATE - 30
group by b.COMPANY;
Thanks for any help
谢谢你的帮助
采纳答案by Karl D.
Well, I can think of a few ways. (1) essentially blow up the dataframe by merging on company
and then filter on the 30 day windows after the merge. This should be fast but could use lots of memory. (2) Move the merging and filtering on the 30 day window into a groupby. This results in a merge for each group so it would be slower but it should use less memory
嗯,我能想到几个方法。(1) 通过合并company
然后在合并后的 30 天窗口上过滤来基本上炸毁数据帧。这应该很快,但可能会使用大量内存。(2) 将30天窗口的合并和过滤移到groupby中。这会导致每个组的合并,因此它会更慢,但它应该使用更少的内存
Option #1
选项1
Suppose your data looks like the following (I expanded your sample data):
假设您的数据如下所示(我扩展了您的示例数据):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
为 30 天的窗口创建一个开始日期:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
现在做一个合并,然后选择基础上,如果date
落入beg_date
和end_date
:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
您可以通过对company
和分组来计算 30 天窗口总和end_date
:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2Move all merging into a groupby. This should be better on memory but I would think much slower:
选项 #2将所有合并移动到 groupby 中。这应该在内存上更好,但我认为慢得多:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another optionNow if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
另一种选择现在,如果您的窗口从不重叠(如示例数据中所示),您可以执行以下操作作为替代方法,它不会炸毁数据帧但速度非常快:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
这种合并本质上是将您的窗口结束日期插入到数据框中,然后回填结束日期(按组)将为您提供一个结构来轻松创建您的求和窗口:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
另一种选择是将您的第一个数据帧重新采样为每日数据,然后使用 30 天的窗口计算滚动总和;并在最后选择您感兴趣的日期。这也可能会占用大量内存。
回答by ANKIT MORAL
There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql
有一种非常简单且实用(或者可能是唯一直接的方法)在 Pandas 中进行条件连接。由于没有直接的方法在 Pandas 中进行条件连接,您将需要一个额外的库,也就是说,pandasql
Install the library pandasql
from pip using the command pip install pandasql
. This library allows you to manipulate the pandas dataframes using the SQL queries.
pandasql
使用命令从 pip安装库pip install pandasql
。该库允许您使用 SQL 查询操作 Pandas 数据帧。
import pandas as pd
from pandasql import sqldf
df = pd.read_excel(r'play_data.xlsx')
df
id Name Amount
0 A001 A 100
1 A002 B 110
2 A003 C 120
3 A005 D 150
Now let's just do a conditional join to compare the Amount of the IDs
现在让我们做一个条件连接来比较 ID 的数量
# Make your pysqldf object:
pysqldf = lambda q: sqldf(q, globals())
# Write your query in SQL syntax, here you can use df as a normal SQL table
cond_join= '''
select
df_left.*,
df_right.*
from df as df_left
join df as df_right
on
df_left.[Amount] > (df_right.[Amount]+10)
'''
# Now, get your queries results as dataframe using the sqldf object that you created
pysqldf(cond_join)
id Name Amount id Name Amount
0 A003 C 120 A001 A 100
1 A005 D 150 A001 A 100
2 A005 D 150 A002 B 110
3 A005 D 150 A003 C 120