使用 PostgreSQL 全文搜索排名的最佳方式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12933805/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Best way to use PostgreSQL full text search ranking
提问by Timmmm
Following on from this answerI want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, andlimit to only matching queries.
继此答案之后,我想知道使用 PostgreSQL 的内置全文搜索的最佳方法是,如果我想按排名排序,并限制为仅匹配查询。
Let's assume a very simple table.
让我们假设一个非常简单的表。
CREATE TABLE pictures (
id SERIAL PRIMARY KEY,
title varchar(300),
...
)
or whatever. Now I want to search the title
field. First I create an index:
管他呢。现在我想搜索该title
字段。首先我创建一个索引:
CREATE INDEX pictures_title ON pictures
USING gin(to_tsvector('english', title));
Now I want to search for 'small dog'
. This works:
现在我想搜索'small dog'
. 这有效:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), 'small dog'
) AS score
FROM pictures
ORDER BY score DESC
But what I really want is this:
但我真正想要的是:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures
WHERE to_tsvector('english', pictures.title) @@ to_tsquery('small dog')
ORDER BY score DESC
Or alternatively this (which doesn't work - can't use score
in the WHERE
clause):
或者这个(这不起作用 - 不能score
在WHERE
子句中使用):
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC
What's the best way to do this? My questions are many-fold:
做到这一点的最佳方法是什么?我的问题是多方面的:
- If I use the version with repeated
to_tsvector(...)
will it call that twice, or is it smart enough to cache the results somehow? - Is there a way to do it without repeating the
to_ts...
function calls? - Is there a way to use
score
in theWHERE
clause at all? - If there is, would it be better to filter by
score > 0
or use the@@
thing?
- 如果我重复使用该版本,
to_tsvector(...)
它会调用两次,还是它足够聪明以某种方式缓存结果? - 有没有办法在不重复
to_ts...
函数调用的情况下做到这一点? - 有没有办法
score
在WHERE
子句中使用? - 如果有,过滤
score > 0
或使用这个@@
东西会更好吗?
回答by isapir
The use of the @@
operator will utilize the full text GIN index, while the test for score > 0
would not.
@@
运算符的使用将使用全文 GIN 索引,而测试则score > 0
不会。
I created a table as in the Question, but added a column named title_tsv
:
我在问题中创建了一个表,但添加了一个名为的列title_tsv
:
CREATE TABLE test_pictures (
id BIGSERIAL,
title text,
title_tsv tsvector
);
CREATE INDEX ix_pictures_title_tsv ON test_pictures
USING gin(title_tsv);
I populated the table with some test data:
我用一些测试数据填充了表格:
INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data)
FROM some_table T;
Then I ran the previously accepted answer with explain analyze
:
然后我运行了以前接受的答案explain analyze
:
EXPLAIN ANALYZE
SELECT score, id, title
FROM (
SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;
And got the following. Note the execution time of 5,015 ms
并得到以下内容。注意5,015 ms的执行时间
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3) |
Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 131kB |
-> Parallel Seq Scan on test_pictures p (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision) |
Rows Removed by Filter: 497296 |
Planning time: 0.159 ms |
Execution time: 5015.664 ms |
Now compare that with the @@
operator:
现在将其与@@
运算符进行比较:
EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
WHERE P.title_tsv @@ to_tsquery('address & shipping')
ORDER BY score DESC;
And the results coming in with an execution time of about 29 ms:
结果以大约 29 毫秒的执行时间出现:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3) |
Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 171kB |
-> Parallel Bitmap Heap Scan on test_pictures p (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
Recheck Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Heap Blocks: exact=625 |
-> Bitmap Index Scan on ix_pictures_title_tsv (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1) |
Index Cond: (title_tsv @@ to_tsquery('address & shipping'::text)) |
Planning time: 0.214 ms |
Execution time: 28.995 ms |
As you can see in the execution plan, the index ix_pictures_title_tsv
was used in the second query, but not in the first one, making the query with the @@
operator a whopping 172 times faster!
正如您在执行计划中所看到的,索引ix_pictures_title_tsv
在第二个查询中使用,但在第一个查询中没有使用,这使得使用@@
运算符的查询快了 172 倍!
回答by Clodoaldo Neto
select *
from (
SELECT
pictures.id,
ts_rank_cd(to_tsvector('english', pictures.title),
to_tsquery('small dog')) AS score
FROM pictures
) s
WHERE score > 0
ORDER BY score DESC
回答by Wolph
If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
如果我使用重复 to_tsvector(...) 的版本,它会调用两次,还是它足够聪明以某种方式缓存结果?
The best way to notice these things is to do a simple explain, although those can be hard to read.
注意这些事情的最好方法是做一个简单的解释,尽管这些解释可能很难阅读。
Long story short, yes, PostgreSQL is smart enough to reuse computed results.
长话短说,是的,PostgreSQL 足够聪明,可以重用计算结果。
Is there a way to do it without repeating the to_ts... function calls?
有没有办法在不重复 to_ts... 函数调用的情况下做到这一点?
What I usually do is add a tsv
column which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.
我通常做的是添加一tsv
列作为文本搜索向量。如果您使用触发器进行此自动更新,它会立即为您提供易于访问的向量,但它还允许您通过使触发器具有选择性来选择性地更新搜索索引。
Is there a way to use score in the WHERE clause at all?
有没有办法在 WHERE 子句中使用 score ?
Yes, but not with that name. Alternatively you could create a sub-query, but I would personally just repeat it.
是的,但不是用那个名字。或者,您可以创建一个子查询,但我个人会重复它。
If there is, would it be better to filter by score > 0 or use the @@ thing?
如果有,按分数> 0 过滤或使用@@ 会更好吗?
The simplest version I can think of is this:
我能想到的最简单的版本是这样的:
SELECT *
FROM pictures
WHERE 'small dog' @@ text_search_vector
The text_search_vector
could obviously be replaced with something like to_tsvector('english', pictures.title)
该text_search_vector
可以明显地喜欢的东西所取代to_tsvector('english', pictures.title)