postgresql dplyr left_join 小于、大于条件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37289405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-21 02:17:17  来源:igfitidea点击:

dplyr left_join by less than, greater than condition

sqlrpostgresqlleft-joindplyr

提问by rajvijay

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteriaand Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

这个问题与Efficiently merging two data frames on a non-trivial criteriaChecking if date is between two date in r 的问题有些相关。我在这里发布的一个请求是否存在该功能: GitHub 问题

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <=and >. Does dplyr::left_join()support this feature? or do the keys only take =operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

我希望使用dplyr::left_join(). 我用来连接的条件是小于、大于 ie<=>。是否dplyr::left_join()支持此功能?或者做钥匙只需要=他们之间的运营商。这很容易从 SQL 运行(假设我在数据库中有数据框)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdatathat are in between two survey years, I join the corresponding survey year data.

这是一个 MWE:我有两个数据集,一个是公司年 ( fdata),第二个是每五年发生一次的调查数据。因此,对于fdata两个调查年度之间的所有年份,我加入了相应的调查年度数据。

id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

我得到

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 
Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 

Unless if left_joincan handle the condition, but my syntax is missing something?

除非如果left_join可以处理条件,但我的语法缺少什么?

采纳答案by Ian Gow

Use a filter. (But note that this answer does notproduce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOINinstead.)

使用一个filter. (但请注意,这个答案并不能产生一个正确的LEFT JOIN;但MWE给出了一个正确的结果INNER JOIN,相反)

The dplyrpackage isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

dplyr若问合并两个表没有的东西合并,所以在下面,我把两个表为此,在一个虚拟变量包不开心,然后过滤,再滴dummy

fdata %>% 
    mutate(dummy=TRUE) %>%
    left_join(sdata %>% mutate(dummy=TRUE)) %>%
    filter(fyear >= byear, fyear < eyear) %>%
    select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummyvariable as evidenced by the following two query explanations:

请注意,如果您在 PostgreSQL 中执行此操作(例如),查询优化器会看穿dummy变量,如以下两个查询解释所证明的那样:

> fdata %>% 
+     mutate(dummy=TRUE) %>%
+     left_join(sdata %>% mutate(dummy=TRUE)) %>%
+     filter(fyear >= byear, fyear < eyear) %>%
+     select(-dummy) %>%
+     explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN 

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"


<PLAN>
Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactlythe same result:

并使用 SQL 更干净地执行它会给出完全相同的结果:

> tbl(pg, sql("
+     SELECT *
+     FROM fdata 
+     LEFT JOIN sdata 
+     ON fyear >= byear AND fyear < eyear")) %>%
+     explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
    SELECT *
    FROM fdata 
    LEFT JOIN sdata 
    ON fyear >= byear AND fyear < eyear) AS "zzz140"


<PLAN>
Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

回答by eddi

data.tableadds non-equi joins starting from v 1.9.8

data.table从 v 1.9.8 开始添加非对等连接

library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place

fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
      .(id, x.fyear, byear, eyear, val)]
#    id x.fyear byear eyear val
# 1:  1    1998  1995  2000   1
# 2:  2    1998  1995  2000   1
# 3:  3    1998  1995  2000   1
# 4:  5    1998  1995  2000   1
# 5:  8    1998  1995  2000   1
# 6: 13    1998  1995  2000   1
# 7:  1    1999  1995  2000   1
# 8:  2    1999  1995  2000   1
# 9:  3    1999  1995  2000   1
#10:  5    1999  1995  2000   1
#11:  8    1999  1995  2000   1
#12: 13    1999  1995  2000   1
#13:  1    2000  2000  2005   5
#14:  2    2000  2000  2005   5
#15:  3    2000  2000  2005   5
#16:  5    2000  2000  2005   5
#17:  8    2000  2000  2005   5
#18: 13    2000  2000  2005   5
#19:  1    2001  2000  2005   5
#20:  2    2001  2000  2005   5
#21:  3    2001  2000  2005   5
#22:  5    2001  2000  2005   5
#23:  8    2001  2000  2005   5
#24:  2    2002  2000  2005   5
#25:  3    2002  2000  2005   5
#26:  2    2003  2000  2005   5
#27:  3    2003  2000  2005   5
#    id x.fyear byear eyear val

You can also get this to work with foverlapsin 1.9.6 with a little more effort.

您也可以foverlaps在 1.9.6 中稍加努力地使用它。

回答by aosmith

This looks like it is the sort of task that package fuzzyjoinaddresses. The various functions of the package look and work similar to the dplyrjoin functions.

这看起来是那种封装了fuzzyjoin地址的任务。包的各种功能的外观和工作方式类似于dplyr连接功能。

In this case one of the fuzzy_*_joinfunctions will work for you. The main difference between dplyr::left_joinand fuzzyjoin::fuzzy_left_joinis that you give a list of functions to use in the matching process with the match.funargument. Note the byargument still is written the same as it would in left_join.

在这种情况下,其中一项fuzzy_*_join功能对您有用。dplyr::left_join和之间的主要区别在于fuzzyjoin::fuzzy_left_join,您给出了在与match.fun参数匹配的过程中使用的函数列表。请注意,该by参数的编写方式仍然与left_join.

Below is an example. The functions I used to match on are >=and <for the fyearto byearand the fyearto eyearcomparisons, respectively. The

下面是一个例子。我用来匹配的函数分别是>=<用于fyeartobyearfyeartoeyear比较。这

library(fuzzyjoin)

fuzzy_left_join(fdata, sdata, 
             by = c("fyear" = "byear", "fyear" = "eyear"), 
             match_fun = list(`>=`, `<`))

Source: local data frame [27 x 5]

      id fyear byear eyear   val
   (dbl) (dbl) (dbl) (dbl) (dbl)
1      1  1998  1995  2000     1
2      1  1999  1995  2000     1
3      1  2000  2000  2005     5
4      1  2001  2000  2005     5
5      2  1998  1995  2000     1
6      2  1999  1995  2000     1
7      2  2000  2000  2005     5
8      2  2001  2000  2005     5
9      2  2002  2000  2005     5
10     2  2003  2000  2005     5
..   ...   ...   ...   ...   ...

回答by alistaire

One option is to join row-wise as a list column, and then unnest the column:

一种选择是按行加入列表列,然后取消嵌套列:

# evaluate each row individually
fdata %>% rowwise() %>% 
    # insert list column of single row of sdata based on conditions
    mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>% 
    # unnest list column
    tidyr::unnest()

# Source: local data frame [27 x 5]
# 
#       id fyear byear eyear   val
#    (dbl) (dbl) (dbl) (dbl) (dbl)
# 1      1  1998  1995  2000     1
# 2      1  1999  1995  2000     1
# 3      1  2000  2000  2005     5
# 4      1  2001  2000  2005     5
# 5      2  1998  1995  2000     1
# 6      2  1999  1995  2000     1
# 7      2  2000  2000  2005     5
# 8      2  2001  2000  2005     5
# 9      2  2002  2000  2005     5
# 10     2  2003  2000  2005     5
# ..   ...   ...   ...   ...   ...