postgresql 如何在正则表达式上连接表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8679079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-20 23:30:19  来源:igfitidea点击:

How to join tables on regex

sqlregexpostgresqljoinpattern-matching

提问by z4y4ts

Say I have two tables msg for messages and mnc for mobile network codes. They share no relations. But I want to join them

假设我有两个表 msg 用于消息,而 mnc 用于移动网络代码。他们没有任何关系。但我想加入他们

SELECT msg.message,
    msg.src_addr,
    msg.dst_addr,
    mnc.name,
FROM "msg"
JOIN "mnc"
ON array_to_string(regexp_matches(msg.src_addr || '+' || msg.dst_addr, '38(...)'), '') = mnc.code

But query fails with error:

但查询失败并出现错误:

psql:marketing.sql:28: ERROR:  argument of JOIN/ON must not return a set
LINE 12: ON array_to_string(regexp_matches(msg.src_addr || '+' || msg...

Is there a way to do such join? Or am I moving wrong way?

有没有办法做这样的加入?还是我走错路了?

回答by Erwin Brandstetter

As @Milen already mentioned regexp_matches()is probably the wrong function for your purpose. You want a simple regular expression match (~). Actually, the LIKE operator (~~)will be faster:

正如@Milen 已经提到的那样,regexp_matches()对于您而言可能是错误的功能。您需要一个简单的正则表达式 match ( ~)。实际上,LIKE 运算符 ( ~~)更快

Presumably fastest with LIKE

大概用 LIKE 最快

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON msg.src_addr ~~ ('%38' || mnc.code || '%')
           OR msg.dst_addr ~~ ('%38' || mnc.code || '%')
WHERE  length(mnc.code) = 3

In addition, you only want mnc.codeof exactly 3 characters.

此外,您只需要mnc.code3 个字符。



With regexp

使用正则表达式

You couldwrite the same with regular expressions but it will most definitely be slower. Here is a working example close to your original:

可以使用正则表达式编写相同的内容,但它肯定会更慢。这是一个接近您原件的工作示例:

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON (msg.src_addr || '+' || msg.dst_addr) ~ (38 || mnc.code)
           AND length(mnc.code) = 3

This also requires msg.src_addrand msg.dst_addrto be NOT NULL.

这也需要msg.src_addr并且msg.dst_addrNOT NULL

The second query demonstrates how the additional check length(mnc.code) = 3can go into the JOINcondition or a WHEREclause. Same effect here.

第二个查询演示了附加检查如何length(mnc.code) = 3进入JOIN条件或WHERE子句。这里效果一样。



With regexp_matches()

使用 regexp_matches()

You couldmake this work with regexp_matches():

可以使用以下方法完成这项工作regexp_matches()

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON EXISTS (
    SELECT * 
    FROM   regexp_matches(msg.src_addr ||'+'|| msg.dst_addr, '38(...)', 'g') x(y)
    WHERE  y[1] = mnc.code
    )

But it will be slow in comparison - or so I assume.

但相比之下它会很慢 - 或者我假设。

Explanation:
Your regexp_matches() expression just returns an array of all captured substrings of the firstmatch. As you only capture one substring (one pair of brackets in your pattern), you will exclusively get arrays with one element.

说明:
您的 regexp_matches() 表达式仅返回第一个匹配项的所有捕获子字符串的数组。由于您只捕获一个子字符串(您的模式中的一对括号),您将只获得一个元素的数组

You get all matcheswith the additional "globally" switch 'g'- but in multiple rows. So you need a sub-select to test them all (or aggregate). Put that in an EXISTS- semi-join and you arrive at what you wanted.

您可以通过附加的“全局”开关获得所有匹配项'g'- 但在多行中。所以你需要一个子选择来测试它们(或聚合)。把它放在一个EXISTS- 半连接中,你就会得到你想要的。

Maybe you can report back with a performance testof all three? Use EXPLAIN ANALYZEfor that.

也许您可以通过对所有三个的性能测试来报告?为此使用EXPLAIN ANALYZE

回答by Milen A. Radev

Your immediate problem is that regexp_matchescould return one or more rows.

您当前的问题是regexp_matches可能返回一行或多行。

回答by small_duck

Try using "substring" instead, which extracts a substring given a regex pattern.

尝试改用“子字符串”,它会提取给定正则表达式模式的子字符串。

SELECT msg.message,
    msg.src_addr,
    msg.dst_addr,
    mnc.name
FROM "msg"
JOIN "mnc"
ON substring(msg.src_addr || '+' || msg.dst_addr from '38(...)') = mnc.code