postgresql dplyr left_join 小于、大于条件

Question

提问by rajvijay

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteriaand Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

这个问题与Efficiently merging two data frames on a non-trivial criteria和Checking if date is between two date in r 的问题有些相关。我在这里发布的一个请求是否存在该功能： GitHub 问题

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <=and >. Does dplyr::left_join()support this feature? or do the keys only take =operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

我希望使用dplyr::left_join(). 我用来连接的条件是小于、大于 ie<=和>。是否dplyr::left_join()支持此功能？或者做钥匙只需要=他们之间的运营商。这很容易从 SQL 运行（假设我在数据库中有数据框）

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdatathat are in between two survey years, I join the corresponding survey year data.

这是一个 MWE：我有两个数据集，一个是公司年 ( fdata)，第二个是每五年发生一次的调查数据。因此，对于fdata两个调查年度之间的所有年份，我加入了相应的调查年度数据。

id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

我得到

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds

Unless if left_joincan handle the condition, but my syntax is missing something?

除非如果left_join可以处理条件，但我的语法缺少什么？

Answer 1

采纳答案by Ian Gow

Use a filter. (But note that this answer does notproduce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOINinstead.)

使用一个filter. （但请注意，这个答案并不能产生一个正确的LEFT JOIN;但MWE给出了一个正确的结果INNER JOIN，相反）

The dplyrpackage isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

在dplyr若问合并两个表没有的东西合并，所以在下面，我把两个表为此，在一个虚拟变量包不开心，然后过滤，再滴dummy：

fdata %>% 
    mutate(dummy=TRUE) %>%
    left_join(sdata %>% mutate(dummy=TRUE)) %>%
    filter(fyear >= byear, fyear < eyear) %>%
    select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummyvariable as evidenced by the following two query explanations:

请注意，如果您在 PostgreSQL 中执行此操作（例如），查询优化器会看穿dummy变量，如以下两个查询解释所证明的那样：

> fdata %>% 
+     mutate(dummy=TRUE) %>%
+     left_join(sdata %>% mutate(dummy=TRUE)) %>%
+     filter(fyear >= byear, fyear < eyear) %>%
+     select(-dummy) %>%
+     explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN 

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"


<PLAN>
Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactlythe same result:

并使用 SQL 更干净地执行它会给出完全相同的结果：

> tbl(pg, sql("
+     SELECT *
+     FROM fdata 
+     LEFT JOIN sdata 
+     ON fyear >= byear AND fyear < eyear")) %>%
+     explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
    SELECT *
    FROM fdata 
    LEFT JOIN sdata 
    ON fyear >= byear AND fyear < eyear) AS "zzz140"


<PLAN>
Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

Answer 2

回答by eddi

data.tableadds non-equi joins starting from v 1.9.8

data.table从 v 1.9.8 开始添加非对等连接

library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place

fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
      .(id, x.fyear, byear, eyear, val)]
#    id x.fyear byear eyear val
# 1:  1    1998  1995  2000   1
# 2:  2    1998  1995  2000   1
# 3:  3    1998  1995  2000   1
# 4:  5    1998  1995  2000   1
# 5:  8    1998  1995  2000   1
# 6: 13    1998  1995  2000   1
# 7:  1    1999  1995  2000   1
# 8:  2    1999  1995  2000   1
# 9:  3    1999  1995  2000   1
#10:  5    1999  1995  2000   1
#11:  8    1999  1995  2000   1
#12: 13    1999  1995  2000   1
#13:  1    2000  2000  2005   5
#14:  2    2000  2000  2005   5
#15:  3    2000  2000  2005   5
#16:  5    2000  2000  2005   5
#17:  8    2000  2000  2005   5
#18: 13    2000  2000  2005   5
#19:  1    2001  2000  2005   5
#20:  2    2001  2000  2005   5
#21:  3    2001  2000  2005   5
#22:  5    2001  2000  2005   5
#23:  8    2001  2000  2005   5
#24:  2    2002  2000  2005   5
#25:  3    2002  2000  2005   5
#26:  2    2003  2000  2005   5
#27:  3    2003  2000  2005   5
#    id x.fyear byear eyear val

You can also get this to work with foverlapsin 1.9.6 with a little more effort.

您也可以foverlaps在 1.9.6 中稍加努力地使用它。

Answer 3

回答by aosmith

This looks like it is the sort of task that package fuzzyjoinaddresses. The various functions of the package look and work similar to the dplyrjoin functions.

这看起来是那种封装了fuzzyjoin地址的任务。包的各种功能的外观和工作方式类似于dplyr连接功能。

In this case one of the fuzzy_*_joinfunctions will work for you. The main difference between dplyr::left_joinand fuzzyjoin::fuzzy_left_joinis that you give a list of functions to use in the matching process with the match.funargument. Note the byargument still is written the same as it would in left_join.

在这种情况下，其中一项fuzzy_*_join功能对您有用。dplyr::left_join和之间的主要区别在于fuzzyjoin::fuzzy_left_join，您给出了在与match.fun参数匹配的过程中使用的函数列表。请注意，该by参数的编写方式仍然与left_join.

Below is an example. The functions I used to match on are >=and <for the fyearto byearand the fyearto eyearcomparisons, respectively. The

下面是一个例子。我用来匹配的函数分别是>=和<用于fyeartobyear和fyeartoeyear比较。这

library(fuzzyjoin)

fuzzy_left_join(fdata, sdata, 
             by = c("fyear" = "byear", "fyear" = "eyear"), 
             match_fun = list(`>=`, `<`))

Source: local data frame [27 x 5]

      id fyear byear eyear   val
   (dbl) (dbl) (dbl) (dbl) (dbl)
1      1  1998  1995  2000     1
2      1  1999  1995  2000     1
3      1  2000  2000  2005     5
4      1  2001  2000  2005     5
5      2  1998  1995  2000     1
6      2  1999  1995  2000     1
7      2  2000  2000  2005     5
8      2  2001  2000  2005     5
9      2  2002  2000  2005     5
10     2  2003  2000  2005     5
..   ...   ...   ...   ...   ...

Answer 4

回答by alistaire

One option is to join row-wise as a list column, and then unnest the column:

一种选择是按行加入列表列，然后取消嵌套列：

# evaluate each row individually
fdata %>% rowwise() %>% 
    # insert list column of single row of sdata based on conditions
    mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>% 
    # unnest list column
    tidyr::unnest()

# Source: local data frame [27 x 5]
# 
#       id fyear byear eyear   val
#    (dbl) (dbl) (dbl) (dbl) (dbl)
# 1      1  1998  1995  2000     1
# 2      1  1999  1995  2000     1
# 3      1  2000  2000  2005     5
# 4      1  2001  2000  2005     5
# 5      2  1998  1995  2000     1
# 6      2  1999  1995  2000     1
# 7      2  2000  2000  2005     5
# 8      2  2001  2000  2005     5
# 9      2  2002  2000  2005     5
# 10     2  2003  2000  2005     5
# ..   ...   ...   ...   ...   ...

postgresql dplyr left_join 小于、大于条件

提问by rajvijay

采纳答案by Ian Gow

回答by eddi

回答by aosmith

回答by alistaire

相关推荐

最近更新

标签

postgresql dplyr left_join 小于、大于条件

提问by rajvijay

采纳答案by Ian Gow

回答by eddi

回答by aosmith

回答by alistaire

相关推荐

asp.net mvc 和 Entity Framework 6 中的 PostgreSQL 数据库连接

PostgreSQL - 或附近的语法错误

postgresql 如何使用 psql \copy 元命令忽略错误

postgresql 与 OR 时的情况

相关推荐

最近更新

标签