SQL PostgreSQL:全文搜索 - 如何搜索部分单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2513501/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-01 05:48:14  来源:igfitidea点击:

PostgreSQL: Full Text Search - How to search partial words?

sqlpostgresqlfull-text-search

提问by Anthoni Gardner

Following a question posted here about how I can increase the speed on one of my SQL Search methods, I was advised to update my table to make use of Full Text Search. This is what I have now done, using Gist indexes to make searching faster. On some of the "plain" queries I have noticed a marked increase which I am very happy about.

根据此处发布的有关如何提高其中一种 SQL 搜索方法的速度的问题,建议我更新我的表以使用全文搜索。这就是我现在所做的,使用 Gist 索引使搜索更快。在一些“普通”查询中,我注意到显着增加,我对此感到非常高兴。

However, I am having difficulty in searching for partial words. For example I have several records that contain the word Squire (454) and I have several records that contain Squirrel (173). Now if I search for Squire it only returns the 454 records but I also want it to return the Squirrel records as well.

但是,我在搜索部分单词时遇到了困难。例如,我有几条记录包含 Squire (454) 一词,还有几条记录包含 Squirrel (173)。现在,如果我搜索 Squire,它只会返回 454 条记录,但我也希望它也返回 Squirrel 记录。

My query looks like this

我的查询看起来像这样

SELECT title 
FROM movies 
WHERE vectors @@ to_tsoquery('squire');

I thought I could do to_tsquery('squire%')but that does not work.
How do I get it to search for partial matches ?

我以为我可以做到,to_tsquery('squire%')但这不起作用。
我如何让它搜索部分匹配?

Also, in my database I have records that are movies and others that are just TV Shows. These are differentiated by the "" over the name, so like "Munsters" is a TV Show, whereas The Munsters is the film of the show. What I want to be able to do is search for just the TV Show AND just the movies. Any idea on how I can achieve this ?

此外,在我的数据库中,我有电影和其他只是电视节目的记录。这些通过名称上的“”来区分,因此“Munsters”是电视节目,而 The Munsters 是节目的电影。我想要做的是只搜索电视节目和电影。关于如何实现这一目标的任何想法?

Regards Anthoni

问候安东尼

采纳答案by thetaiko

Even using LIKEyou will not be able to get 'squirrel' from squire%because 'squirrel' has two 'r's. To get Squire and Squirrel you could run the following query:

即使使用,LIKE您也无法从中获得 'squirrel',squire%因为 'squirrel' 有两个 'r'。要获取 Squire 和 Squirrel,您可以运行以下查询:

SELECT title FROM movies WHERE vectors @@ to_tsquery('squire|squirrel');

To differentiate between movies and tv shows you should add a column to your database. However, there are many ways to skin this cat. You could use a sub-query to force postgres to first find the movies matching 'squire' and 'squirrel' and then search that subset to find titles that begin with a '"'. It is possible to create indexes for use in LIKE '"%...'searches.

要区分电影和电视节目,您应该在数据库中添加一列。然而,有很多方法可以给这只猫剥皮。您可以使用子查询来强制 postgres 首先查找与 'squire' 和 'squirrel' 匹配的电影,然后搜索该子集以查找以 '"' 开头的标题。可以创建用于LIKE '"%...'搜索的索引。

Without exploring other indexing possibilities you could also run these - mess around with them to find which is fastest:

在不探索其他索引可能性的情况下,您也可以运行这些 - 与它们混在一起以找到最快的:

SELECT title 
FROM (
   SELECT * 
   FROM movies 
   WHERE vectors @@ to_tsquery('squire|squirrel')
) t
WHERE title ILIKE '"%';

or

或者

SELECT title 
FROM movies 
WHERE vectors @@ to_tsquery('squire|squirrel') 
  AND title ILIKE '"%';

回答by Alexander Mera

Try,

尝试,

SELECT title FROM movies WHERE to_tsvector(title) @@ to_tsquery('squire:*')

This works on PostgreSQL 8.4+

这适用于 PostgreSQL 8.4+

回答by Joshua Burns

Anthoni,

安东尼,

Assuming you plan on using only ASCII encoding (could be difficult, I'm aware), a very viable option may be the Trigram (pg_trgm) module: http://www.postgresql.org/docs/9.0/interactive/pgtrgm.html

假设您计划仅使用 ASCII 编码(可能很困难,我知道),一个非常可行的选择可能是 Trigram (pg_trgm) 模块:http: //www.postgresql.org/docs/9.0/interactive/pgtrgm。 html

Trigram utilizes built-in indexing methods such as Gist and Gin. The only modification you have to make is when defining your index, specify an Operator Class of either gist_trgm_opsor gin_trgm_ops.

Trigram 使用内置的索引方法,例如 Gist 和 Gin。您唯一需要做的修改是在定义索引时,指定gist_trgm_ops或的运算符类gin_trgm_ops

If the contrib modules aren't already installed, in Ubuntu it's as easy and running the following command from the shell:

如果尚未安装 contrib 模块,在 Ubuntu 中它很简单,从 shell 运行以下命令:

# sudo apt-get install postgresql-contrib

After the contrib modules are made available, you must install the pg_trgm extension into the database in question. You do this by executing the following PostgreSQL query on the database you wish to install the module into:

在 contrib 模块可用后,您必须将 pg_trgm 扩展安装到相关数据库中。您可以通过在要安装模块的数据库上执行以下 PostgreSQL 查询来完成此操作:

CREATE EXTENSION pg_trgm;

After the pg_trgm extension has been installed, we're ready to have some fun!

安装 pg_trgm 扩展后,我们准备好享受一些乐趣了!

-- Create a test table.
CREATE TABLE test (my_column text)
-- Create a Trigram index.
CREATE INDEX test_my_colun_trgm_idx ON test USING gist (my_column gist_trgm_ops);
-- Add a couple records
INSERT INTO test (my_Column) VALUES ('First Entry'), ('Second Entry'), ('Third Entry')
-- Query using our new index --
SELECT my_column, similarity(my_column, 'Frist Entry') AS similarity FROM test WHERE my_column % 'Frist Entry' ORDER BY similarity DESC

回答by Greg

@alexander-mera solution works great!

@alexander-mera 解决方案效果很好!

Note: Also make sure to convert spaces to +. For example, if you are searching for squire knight.

注意:还要确保将空格转换为+. 例如,如果您正在搜索squire knight.

SELECT title FROM movies WHERE to_tsvector(title) @@ to_tsquery('squire+knight:*')

回答by brightball

The broad solution to this is to use PG's ts_rewrite function to setup an aliases table that works for alternate matches (see Query Rewriting). This covers cases like yours above while also handling completely different cases like searching for tree ratand getting results for squirrel, etc.

对此的广泛解决方案是使用 PG 的 ts_rewrite 函数来设置适用于替代匹配的别名表(请参阅查询重写)。这涵盖了像上面这样的情况,同时还处理完全不同的情况,例如搜索tree rat和获取结果squirrel等。

Full details and explanation at that link, but the gist of it is that you can setup an aliases table with 2 ts_query columns and pass a query of that table to in with your search, like so:

该链接的完整详细信息和解释,但其要点是您可以设置一个带有 2 个 ts_query 列的别名表,并将该表的查询传递到您的搜索中,如下所示:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');

Resulting in a final query that looks more like:

导致最终查询看起来更像:

WHERE vectors @@ ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases')

This is similar to the thesaurus setup within PG but works without requiring a full reindex every time you add something. As you come across little spelling variations and cases of "when I search for this I expect results like this" it's very easy to just add them to the table real quick. You can add more columns to that table as well as long as the query based to ts_rewritereturns the 2 expected to_tsquerycolumns.

这类似于 PG 中的同义词设置,但每次添加内容时都不需要完全重新索引。当您遇到很少的拼写变化和“当我搜索这个时,我希望得到这样的结果”的情况时,很容易将它们快速添加到表格中。您可以向该表添加更多列,并且只要基于查询ts_rewrite返回 2 个预期to_tsquery列。

When you dig into that documentation you'll see suggested examples for performance tuning as well. There's a balance between using trigram for pure speed and using vector/query/rewrite for robustness.

当您深入研究该文档时,您还会看到建议的性能调整示例。在使用三元组来提高速度和使用向量/查询/重写来提高健壮性之间存在平衡。

回答by John Kane

One thing that may work is break the word you are searching for into smaller parts. So you could look for things that have squi or quir or squire or etc... I'm not sure how efficient that would be though, but it may help.

可能有用的一件事是将您正在搜索的词分解成更小的部分。所以你可以寻找有 squi 或 quir 或 squire 等的东西......我不确定这会有多有效,但它可能会有所帮助。

When you search for the film or movie you could try placing the text in the single quote. so it would be either 'show' or '"show"'. I think that could also work.

当您搜索电影或电影时,您可以尝试将文本放在单引号中。所以它要么是'show',要么是'"show"'。我认为这也可以。