如何在 python Pandas 中进行条件连接?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23508351/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:03:28  来源:igfitidea点击:

How to do a conditional join in python Pandas?

pythonjoinpandas

提问by JAB

I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.

我正在尝试根据存储在单独表中的日期值计算 Pandas 中基于时间的聚合。

The top of the first table table_a looks like this:

第一个表 table_a 的顶部如下所示:

    COMPANY_ID  DATE            MEASURE
    1   2010-01-01 00:00:00     10
    1   2010-01-02 00:00:00     10
    1   2010-01-03 00:00:00     10
    1   2010-01-04 00:00:00     10
    1   2010-01-05 00:00:00     10

Here is the code to create the table:

下面是创建表的代码:

    table_a = pd.concat(\
    [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 1 , 'MEASURE': 10}),\
    pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 2 , 'MEASURE': 10})])

The second table, table_b looks like this:

第二个表 table_b 如下所示:

        COMPANY     END_DATE
        1   2010-03-01 00:00:00
        1   2010-06-02 00:00:00
        2   2010-03-01 00:00:00
        2   2010-06-02 00:00:00

and the code to create it is:

创建它的代码是:

    table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
                    'COMPANY':(1,1,2,2)})

I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.

我希望能够获得 table_b 中 END_DATE 之前每 30 天期间每个 COMPANY_ID 的度量列的总和。

This is (I think) the SQL equivalent:

这是(我认为)SQL 等效项:

      select
 b.COMPANY_ID,
 b.DATE
 sum(a.MEASURE) AS MEASURE_TO_END_DATE
 from table_a a, table_b b
 where a.COMPANY = b.COMPANY and
       a.DATE < b.DATE and
       a.DATE > b.DATE - 30  
 group by b.COMPANY;

Thanks for any help

谢谢你的帮助

采纳答案by Karl D.

Well, I can think of a few ways. (1) essentially blow up the dataframe by merging on companyand then filter on the 30 day windows after the merge. This should be fast but could use lots of memory. (2) Move the merging and filtering on the 30 day window into a groupby. This results in a merge for each group so it would be slower but it should use less memory

嗯,我能想到几个方法。(1) 通过合并company然后在合并后的 30 天窗口上过滤来基本上炸毁数据帧。这应该很快,但可能会使用大量内存。(2) 将30天窗口的合并和过滤移到groupby中。这会导致每个组的合并,因此它会更慢,但它应该使用更少的内存

Option #1

选项1

Suppose your data looks like the following (I expanded your sample data):

假设您的数据如下所示(我扩展了您的示例数据):

print df

    company       date  measure
0         0 2010-01-01       10
1         0 2010-01-15       10
2         0 2010-02-01       10
3         0 2010-02-15       10
4         0 2010-03-01       10
5         0 2010-03-15       10
6         0 2010-04-01       10
7         1 2010-03-01        5
8         1 2010-03-15        5
9         1 2010-04-01        5
10        1 2010-04-15        5
11        1 2010-05-01        5
12        1 2010-05-15        5

print windows

   company   end_date
0        0 2010-02-01
1        0 2010-03-15
2        1 2010-04-01
3        1 2010-05-15

Create a beginning date for the 30 day windows:

为 30 天的窗口创建一个开始日期:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))
print windows

   company   end_date   beg_date
0        0 2010-02-01 2010-01-02
1        0 2010-03-15 2010-02-13
2        1 2010-04-01 2010-03-02
3        1 2010-05-15 2010-04-15

Now do a merge and then select based on if datefalls within beg_dateand end_date:

现在做一个合并,然后选择基础上,如果date落入beg_dateend_date

df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df

    company       date  measure   end_date   beg_date
2         0 2010-01-15       10 2010-02-01 2010-01-02
4         0 2010-02-01       10 2010-02-01 2010-01-02
7         0 2010-02-15       10 2010-03-15 2010-02-13
9         0 2010-03-01       10 2010-03-15 2010-02-13
11        0 2010-03-15       10 2010-03-15 2010-02-13
16        1 2010-03-15        5 2010-04-01 2010-03-02
18        1 2010-04-01        5 2010-04-01 2010-03-02
21        1 2010-04-15        5 2010-05-15 2010-04-15
23        1 2010-05-01        5 2010-05-15 2010-04-15
25        1 2010-05-15        5 2010-05-15 2010-04-15

You can compute the 30 day window sums by grouping on companyand end_date:

您可以通过对company和分组来计算 30 天窗口总和end_date

print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

Option #2Move all merging into a groupby. This should be better on memory but I would think much slower:

选项 #2将所有合并移动到 groupby 中。这应该在内存上更好,但我认为慢得多:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))

def cond_merge(g,windows):
    g = g.merge(windows,on='company',how='left')
    g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
    return g.groupby('end_date')['measure'].sum()

print df.groupby('company').apply(cond_merge,windows)

company  end_date  
0        2010-02-01    20
         2010-03-15    30
1        2010-04-01    10
         2010-05-15    15

Another optionNow if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

另一种选择现在,如果您的窗口从不重叠(如示例数据中所示),您可以执行以下操作作为替代方法,它不会炸毁数据帧但速度非常快:

windows['date'] = windows['end_date']

df = df.merge(windows,on=['company','date'],how='outer')
print df

    company       date  measure   end_date
0         0 2010-01-01       10        NaT
1         0 2010-01-15       10        NaT
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10        NaT
4         0 2010-03-01       10        NaT
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5        NaT
8         1 2010-03-15        5        NaT
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5        NaT
11        1 2010-05-01        5        NaT
12        1 2010-05-15        5 2010-05-15

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

这种合并本质上是将您的窗口结束日期插入到数据框中,然后回填结束日期(按组)将为您提供一个结构来轻松创建您的求和窗口:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())

print df

    company       date  measure   end_date
0         0 2010-01-01       10 2010-02-01
1         0 2010-01-15       10 2010-02-01
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10 2010-03-15
4         0 2010-03-01       10 2010-03-15
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5 2010-04-01
8         1 2010-03-15        5 2010-04-01
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5 2010-05-15
11        1 2010-05-01        5 2010-05-15
12        1 2010-05-15        5 2010-05-15

df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
                   np.timedelta64(30,'D'))

print df

   company       date  measure   end_date   beg_date
0         0 2010-01-01       10 2010-02-01 2010-01-02
1         0 2010-01-15       10 2010-02-01 2010-01-02
2         0 2010-02-01       10 2010-02-01 2010-01-02
3         0 2010-02-15       10 2010-03-15 2010-02-13
4         0 2010-03-01       10 2010-03-15 2010-02-13
5         0 2010-03-15       10 2010-03-15 2010-02-13
7         1 2010-03-01        5 2010-04-01 2010-03-02
8         1 2010-03-15        5 2010-04-01 2010-03-02
9         1 2010-04-01        5 2010-04-01 2010-03-02
10        1 2010-04-15        5 2010-05-15 2010-04-15
11        1 2010-05-01        5 2010-05-15 2010-04-15
12        1 2010-05-15        5 2010-05-15 2010-04-15

df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

另一种选择是将您的第一个数据帧重新采样为每日数据,然后使用 30 天的窗口计算滚动总和;并在最后选择您感兴趣的日期。这也可能会占用大量内存。

回答by ANKIT MORAL

There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql

有一种非常简单且实用(或者可能是唯一直接的方法)在 Pandas 中进行条件连接。由于没有直接的方法在 Pandas 中进行条件连接,您将需要一个额外的库,也就是说,pandasql

Install the library pandasqlfrom pip using the command pip install pandasql. This library allows you to manipulate the pandas dataframes using the SQL queries.

pandasql使用命令从 pip安装库pip install pandasql。该库允许您使用 SQL 查询操作 Pandas 数据帧。

import pandas as pd
from pandasql import sqldf

df = pd.read_excel(r'play_data.xlsx')
df

    id    Name  Amount
0   A001    A   100
1   A002    B   110
2   A003    C   120
3   A005    D   150

Now let's just do a conditional join to compare the Amount of the IDs

现在让我们做一个条件连接来比较 ID 的数量

# Make your pysqldf object:
pysqldf = lambda q: sqldf(q, globals())

# Write your query in SQL syntax, here you can use df as a normal SQL table
cond_join= '''
    select 
        df_left.*,
        df_right.*
    from df as df_left
    join df as df_right
    on
        df_left.[Amount] > (df_right.[Amount]+10)

'''

# Now, get your queries results as dataframe using the sqldf object that you created
pysqldf(cond_join)

    id  Name    Amount  id    Name  Amount
0   A003    C   120    A001   A   100
1   A005    D   150    A001   A   100
2   A005    D   150    A002   B   110
3   A005    D   150    A003   C   120