SQL 如何仅使用 Postgresql 创建简单的模糊搜索?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7730027/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 12:25:14  来源:igfitidea点击:

How to create simple fuzzy search with Postgresql only?

sqlruby-on-railspostgresqlfuzzy-search

提问by Alve

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:

我的基于 RoR 的站点上的搜索功能存在一些问题。我有很多带有一些代码的产品。此代码可以是任何字符串,例如“AB-123-lHdfj”。现在我使用 ILIKE 运算符来查找产品:

Product.where("code ILIKE ?", "%" + params[:search] + "%")

It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".

它工作正常,但找不到带有“AB123-lHdfj”或“AB123lHdfj”等代码的产品。

What should I do for this? May be postgresql has some string normalization function, or some other methods to help me? :)

我该怎么做?可能是 postgresql 有一些字符串规范化功能,或者其他一些方法来帮助我?:)

回答by Paul Sasik

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshteinedit distance function.

Postgres 提供了一个具有多个字符串比较功能的模块,例如 soundex 和 metaphone。但是您会想要使用levenshtein编辑距离功能。

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

The 2is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

2两个词之间的编辑距离。当您将其应用于多个单词并按编辑距离结果排序时,您将拥有您正在寻找的模糊匹配类型。

Try this query sample: (with your own object names and data of course)

试试这个查询示例:(当然还有你自己的对象名称和数据)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

这个查询说:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

给我来自 some_table 的所有数据的前 10 个结果,其中代码值与输入 'AB123-lHdfj' 之间的编辑距离小于 3。您将返回代码值与 ' 相差 3 个字符以内的所有行AB123-lHdfj'...

Note: if you get an error like:

注意:如果您收到如下错误:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatchextension using:

fuzzystrmatch使用以下命令安装扩展:

test=# CREATE EXTENSION fuzzystrmatch;

回答by Erwin Brandstetter

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the levenshtein-distance from the search term for every single row, that's expensive.

保罗告诉过你levenshtein()。这是一个非常有用的工具,但对于大表来说也很慢。它必须为每一行计算与搜索词的 levenshtein 距离,这很昂贵。

First off, ifyour requirements are as simple as the example indicates, you can still use LIKE. Just replace any -in your search term with %to create the WHEREclause

首先,如果您的要求如示例所示一样简单,您仍然可以使用LIKE. 只需将-搜索词中的any 替换为with%即可创建WHERE子句

WHERE code LIKE "%AB%123%lHdfj%"

instead of

代替

WHERE code LIKE "%AB-123-lHdfj%"

If your real problem is more complexand you need something faster then - depending on your requirements - there are several options.

如果您的实际问题更复杂并且您需要更快的速度,那么 - 根据您的要求 - 有多种选择。

  • There is full text search, of course. But this may be an overkill in your case.

  • A more likely candidate is pg_trgm. Note that you can combine that with LIKEin PostgreSQL 9.1. See this blog post by Depesz.
    Also very interesting in this context: the similarity()function or %operator of that module. More:

  • Last but not least you can implement a hand-knit solution with a function to normalizethe strings to be searched. For instance, you could transform AB1-23-lHdfj-> ab123lhdfj, save it in an additional column and search it with search terms that have been transformed the same way.

    Or use an index on an expressioninstead of the redundant column. (Involved functions must be IMMUTABLE.) And possibly combine that with pg_tgrmfrom above.

  • 当然还有全文搜索。但这在你的情况下可能是一种矫枉过正。

  • 更有可能的候选者是pg_trgm。请注意,您可以将其与LIKEPostgreSQL 9.1结合使用。请参阅Depesz 的这篇博文
    在这种情况下也非常有趣:该模块的similarity()函数或%运算符。更多的:

  • 最后但并非最不重要的一点是,您可以使用一个函数来实现手工编织的解决方案,以规范化要搜索的字符串。例如,您可以转换AB1-23-lHdfj-> ab123lhdfj,将其保存在附加列中,并使用已以相同方式转换的搜索词进行搜索。

    或者在表达式上使用索引而不是冗余列。(涉及的函数必须是IMMUTABLE。)并且可能将其与pg_tgrm上面的结合起来。

Overview of pattern-matching techniques:

模式匹配技术概述: