postgresql dplyr left_join 小于、大于条件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37289405/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
dplyr left_join by less than, greater than condition
提问by rajvijay
This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteriaand Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue
这个问题与Efficiently merging two data frames on a non-trivial criteria和Checking if date is between two date in r 的问题有些相关。我在这里发布的一个请求是否存在该功能: GitHub 问题
I am looking to join two dataframes using dplyr::left_join()
. The condition I use to join is less-than, greater-than i.e, <=
and >
. Does dplyr::left_join()
support this feature? or do the keys only take =
operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)
我希望使用dplyr::left_join()
. 我用来连接的条件是小于、大于 ie<=
和>
。是否dplyr::left_join()
支持此功能?或者做钥匙只需要=
他们之间的运营商。这很容易从 SQL 运行(假设我在数据库中有数据框)
Here is a MWE: I have two datasets one firm-year (fdata
), while second is sort of survey data that happens once every five years. So for all years in the fdata
that are in between two survey years, I join the corresponding survey year data.
这是一个 MWE:我有两个数据集,一个是公司年 ( fdata
),第二个是每五年发生一次的调查数据。因此,对于fdata
两个调查年度之间的所有年份,我加入了相应的调查年度数据。
id <- c(1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,
5,5,5,5,
8,8,8,8,
13,13,13)
fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
1998,1999,2000,2001,1998,1999,2000)
byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)
sdata <- tbl_df(data.frame(byear, eyear, val))
fdata <- tbl_df(data.frame(id, fyear))
test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))
I get
我得到
Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds
Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds
Unless if left_join
can handle the condition, but my syntax is missing something?
除非如果left_join
可以处理条件,但我的语法缺少什么?
采纳答案by Ian Gow
Use a filter
. (But note that this answer does notproduce a correct LEFT JOIN
; but the MWE gives the right result with an INNER JOIN
instead.)
使用一个filter
. (但请注意,这个答案并不能产生一个正确的LEFT JOIN
;但MWE给出了一个正确的结果INNER JOIN
,相反)
The dplyr
package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy
:
在dplyr
若问合并两个表没有的东西合并,所以在下面,我把两个表为此,在一个虚拟变量包不开心,然后过滤,再滴dummy
:
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy
variable as evidenced by the following two query explanations:
请注意,如果您在 PostgreSQL 中执行此操作(例如),查询优化器会看穿dummy
变量,如以下两个查询解释所证明的那样:
> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
and doing it more cleanly with SQL gives exactlythe same result:
并使用 SQL 更干净地执行它会给出完全相同的结果:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
回答by eddi
data.table
adds non-equi joins starting from v 1.9.8
data.table
从 v 1.9.8 开始添加非对等连接
library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place
fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
.(id, x.fyear, byear, eyear, val)]
# id x.fyear byear eyear val
# 1: 1 1998 1995 2000 1
# 2: 2 1998 1995 2000 1
# 3: 3 1998 1995 2000 1
# 4: 5 1998 1995 2000 1
# 5: 8 1998 1995 2000 1
# 6: 13 1998 1995 2000 1
# 7: 1 1999 1995 2000 1
# 8: 2 1999 1995 2000 1
# 9: 3 1999 1995 2000 1
#10: 5 1999 1995 2000 1
#11: 8 1999 1995 2000 1
#12: 13 1999 1995 2000 1
#13: 1 2000 2000 2005 5
#14: 2 2000 2000 2005 5
#15: 3 2000 2000 2005 5
#16: 5 2000 2000 2005 5
#17: 8 2000 2000 2005 5
#18: 13 2000 2000 2005 5
#19: 1 2001 2000 2005 5
#20: 2 2001 2000 2005 5
#21: 3 2001 2000 2005 5
#22: 5 2001 2000 2005 5
#23: 8 2001 2000 2005 5
#24: 2 2002 2000 2005 5
#25: 3 2002 2000 2005 5
#26: 2 2003 2000 2005 5
#27: 3 2003 2000 2005 5
# id x.fyear byear eyear val
You can also get this to work with foverlaps
in 1.9.6 with a little more effort.
您也可以foverlaps
在 1.9.6 中稍加努力地使用它。
回答by aosmith
This looks like it is the sort of task that package fuzzyjoinaddresses. The various functions of the package look and work similar to the dplyrjoin functions.
这看起来是那种封装了fuzzyjoin地址的任务。包的各种功能的外观和工作方式类似于dplyr连接功能。
In this case one of the fuzzy_*_join
functions will work for you. The main difference between dplyr::left_join
and fuzzyjoin::fuzzy_left_join
is that you give a list of functions to use in the matching process with the match.fun
argument. Note the by
argument still is written the same as it would in left_join
.
在这种情况下,其中一项fuzzy_*_join
功能对您有用。dplyr::left_join
和之间的主要区别在于fuzzyjoin::fuzzy_left_join
,您给出了在与match.fun
参数匹配的过程中使用的函数列表。请注意,该by
参数的编写方式仍然与left_join
.
Below is an example. The functions I used to match on are >=
and <
for the fyear
to byear
and the fyear
to eyear
comparisons, respectively. The
下面是一个例子。我用来匹配的函数分别是>=
和<
用于fyear
tobyear
和fyear
toeyear
比较。这
library(fuzzyjoin)
fuzzy_left_join(fdata, sdata,
by = c("fyear" = "byear", "fyear" = "eyear"),
match_fun = list(`>=`, `<`))
Source: local data frame [27 x 5]
id fyear byear eyear val
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1998 1995 2000 1
2 1 1999 1995 2000 1
3 1 2000 2000 2005 5
4 1 2001 2000 2005 5
5 2 1998 1995 2000 1
6 2 1999 1995 2000 1
7 2 2000 2000 2005 5
8 2 2001 2000 2005 5
9 2 2002 2000 2005 5
10 2 2003 2000 2005 5
.. ... ... ... ... ...
回答by alistaire
One option is to join row-wise as a list column, and then unnest the column:
一种选择是按行加入列表列,然后取消嵌套列:
# evaluate each row individually
fdata %>% rowwise() %>%
# insert list column of single row of sdata based on conditions
mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>%
# unnest list column
tidyr::unnest()
# Source: local data frame [27 x 5]
#
# id fyear byear eyear val
# (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 1998 1995 2000 1
# 2 1 1999 1995 2000 1
# 3 1 2000 2000 2005 5
# 4 1 2001 2000 2005 5
# 5 2 1998 1995 2000 1
# 6 2 1999 1995 2000 1
# 7 2 2000 2000 2005 5
# 8 2 2001 2000 2005 5
# 9 2 2002 2000 2005 5
# 10 2 2003 2000 2005 5
# .. ... ... ... ... ...